Dev.to #systemdesign·March 17, 2026

Designing Robust Systems: Differentiating Errors from Exceptions

This article clarifies the crucial distinction between errors and exceptions in software systems, arguing that errors are expected, recoverable outcomes while exceptions signify unexpected, unrecoverable system invariants violations. Properly distinguishing these impacts system reliability, observability, API design, and resilience in distributed environments. Adhering to this principle leads to more robust, maintainable, and predictable software architectures.

Distributed Systems API Design Performance & Scaling

Read original on Dev.to #systemdesign

The Foundational Distinction

In building resilient systems, a core principle is understanding how to handle unexpected situations. This article emphasizes that "errors" and "exceptions" serve fundamentally different purposes, and conflating them can lead to significant architectural and operational problems. Errors are expected, recoverable outcomes that are part of normal system behavior, such as a user providing invalid input or a resource not being found. Exceptions are unexpected, unrecoverable violations of system invariants or assumptions, indicating a truly exceptional, often fatal, state.

💡

Practical Rule for System Architects

If the caller can handle it, it's an error. If the system cannot safely proceed, it's an exception. This rule guides the design of resilient systems and clear API contracts.

Impact on System Design and Architecture

Reliability: Treating expected errors as exceptions can cause unnecessary system breaks, missed retries, and failed fallback mechanisms. Proper error handling ensures predictable system behavior under common failure scenarios.
Observability: Misusing exceptions for routine errors creates noisy logs and obscures genuine system issues, making incident response and debugging significantly harder.
API Design: Explicitly modeling errors in API contracts (e.g., returning `User | null` or `(User, error)` tuples) makes interfaces predictable and easier for consumers to reason about. APIs that throw exceptions for expected outcomes break their contract.
Distributed Systems Resilience: Distributed environments inherently face network partitions, timeouts, and partial failures. Architectures that treat these expected distributed challenges as exceptions are prone to cascading failures and unpredictable behavior, highlighting the need for robust, explicit error handling.
Maintainability and Team Productivity: Clear distinction reduces cognitive load for developers, leading to faster debugging, easier onboarding, and overall more maintainable codebase.

Architectural Implications for Error and Exception Handling

Architecturally, errors should be handled gracefully through explicit returns, discriminated unions, or `Result` types, allowing upstream components to take corrective actions. Exceptions, conversely, should ideally lead to fast failures, immediate logging (e.g., to an error tracking system), and potentially system restarts or circuit breaker activations. This strategy ensures that critical system invariants are preserved and that truly anomalous situations are brought to immediate attention rather than being silently suppressed or mishandled.

typescript

// ✅ Modeling errors explicitly in TypeScript for expected outcomes
function getUser(id: string): User | null {
  return db.find(id) ?? null
}

// ❌ Using exceptions for normal outcomes leads to fragile APIs
function getUserThrowing(id: string): User {
  const user = db.find(id)
  if (!user) throw new Error("User not found")
  return user
}

error handlingexception handlingreliabilityresilienceapi contractssoftware architectureobservabilitymaintainability

Comments

Loading comments...

Architecture Design

Design this yourself

Design a distributed microservices platform, focusing on a robust error and exception handling strategy across services. Define clear API contracts for both expected errors and truly exceptional conditions, including mechanisms for error propagation, correlation, and centralized logging. Detail how service-level errors (e.g., resource not found) are distinguished from system-level exceptions (e.g., unhandled runtime crash) and how each is handled to maintain system reliability and observability.

Practice Interview

Focus: error and exception handling patterns in distributed systems

Other design angles

· Design an e-commerce API gateway and backend services. Focus on how payment processing failures (expected errors) are handled vs. critical infrastructure failures (exceptions), including user-facing communication and retry mechanisms.· Design a real-time analytics pipeline that processes large volumes of data. Detail the error handling strategy for malformed data or upstream service unavailability (errors) versus unrecoverable application crashes within a processing worker (exceptions), ensuring data integrity and pipeline resilience.