Menu
Dev.to #systemdesign·June 12, 2026

Fault Tolerance Patterns: Circuit Breakers, Bulkheads, and Graceful Failure

This article explores essential fault tolerance patterns in distributed systems, using the analogy of the Titanic's bulkheads to illustrate the importance of complete implementation. It details how cascading failures occur when shared resources are exhausted by slow dependencies and introduces patterns like Timeout, Retry with Exponential Backoff and Jitter, and Circuit Breakers to prevent such widespread outages.

Read original on Dev.to #systemdesign

Understanding Cascading Failures

In distributed systems, a slow or failing dependency can quickly lead to a system-wide outage. The article uses a clear example: if a `Payment Service` becomes slow, the `Order Service` calling it will hold onto threads longer. This can exhaust `Order Service`'s thread pool, causing it to reject all new requests, even those unrelated to payments. This failure then propagates upwards through the call graph, leading to a complete platform unresponsiveness. The core problem is a slow dependency consuming a shared resource needed for other operations.

plaintext
Step 1: Payment Service becomes slow (database under load, 5 seconds per call instead of 50ms)
Step 2: Order Service calls Payment Service, waits...
  Order Service has a thread pool of 100 threads
  Each call to Payment Service now holds a thread for 5 seconds (instead of 50ms)
  100x more threads are tied up per unit time
Step 3: Order Service's thread pool exhausts
  All 100 threads are blocked waiting on Payment Service
  New incoming requests to Order Service have no threads available
  Order Service starts rejecting/timing out ALL requests 
Step 4: Services calling Order Service experience the same problem
Step 5: Cascade continues upward through the entire call graph
  The ENTIRE platform becomes unresponsive 
  because ONE service (Payment) got slow.

Essential Fault Tolerance Patterns

1. Timeout: Never Wait Forever

The most fundamental pattern is setting explicit timeouts for all external calls. Without a timeout, a hung dependency can hold a thread indefinitely, contributing to resource exhaustion. With a timeout, the thread is released after a bounded duration, allowing it to serve other requests. Timeouts should be based on the p99 latency of the dependency, providing headroom for normal variance while failing fast for genuine hangs. Crucially, timeouts must be applied at every layer, including HTTP clients, database drivers, and connection pool acquisitions, to prevent hidden resource leaks.

python
# WITH timeout 
response = requests.get("http://payment-service/charge", timeout=2.0) 
# After 2 seconds with no response, raises a TimeoutError
  • Too short timeouts: Can cause legitimate slow requests to fail unnecessarily under normal load spikes.
  • Too long timeouts: Still allows threads to be tied up for extended periods during failures, triggering cascading failures more slowly.
  • Rule of thumb: Set timeout based on p99 latency of the dependency, e.g., if p99 is 200ms, set timeout at 500ms-1s.

2. Retry with Exponential Backoff and Jitter

Transient errors often resolve on retry, but naive retries can exacerbate issues. A "synchronized retry storm" occurs when many clients retry simultaneously, overwhelming a recovering service. Exponential backoff addresses this by increasing the delay between retries, giving the service more time to recover. However, even with exponential backoff, if clients start retrying at the same time, they can still hit the service in synchronized waves. Jitter introduces randomness to these delays, desynchronizing retries and spreading the load more evenly, which is critical for allowing a struggling service to stabilize. AWS's "full jitter" approach, which randomizes the wait time between zero and the calculated exponential backoff, is highlighted as a robust strategy.

fault toleranceresiliencecircuit breakerbulkheadtimeoutretryexponential backoffjitter

Comments

Loading comments...