Circuit Breaker Pattern
Prevent cascade failures: closed, open, and half-open states. Trip thresholds, recovery strategies, and monitoring circuit breaker health.
Why Services Need a Circuit Breaker
In a distributed system, every network call can fail. When an upstream service is slow or down, callers that keep retrying can exhaust their own thread pools, fill their own connection pools, and cascade the failure downstream. A circuit breaker is a proxy that tracks recent call outcomes and automatically stops forwarding requests to a degraded dependency — giving it time to recover and protecting the caller from resource exhaustion.
The name comes from the electrical circuit breaker that trips when current spikes too high, disconnecting the circuit to prevent damage. The software version trips when the error rate (or slow-call rate) crosses a threshold, opening the circuit to fast-fail all further calls.
The Three States
| State | Behaviour | Next Transition |
|---|---|---|
| Closed | Requests pass through normally. Failures are counted in a sliding window. | → Open when error rate exceeds threshold |
| Open | All requests fail immediately with a fallback error. No calls reach the dependency. | → Half-Open after a reset timeout (e.g., 30 s) |
| Half-Open | A single probe request is let through to test recovery. | → Closed on success; → Open on failure |
Key Configuration Parameters
- Failure threshold — percentage of calls that must fail within the window to trip the breaker (e.g., 50 %)
- Minimum request volume — don't trip if only 2 of 2 calls failed; require at least N calls (e.g., 20) before evaluating
- Sliding window size — count-based (last N calls) or time-based (last N seconds)
- Slow-call threshold — treat calls slower than X ms as failures (prevents latency spikes from being invisible)
- Reset timeout — how long to wait in Open state before trying a probe (e.g., 30 s, with exponential growth on repeated trips)
- Half-open permitted calls — how many probes to allow before deciding to close or re-open
Sequence: Normal, Tripped, and Recovery
Fallback Strategies
An open circuit should not just throw an error — it should degrade gracefully. Common fallback approaches include returning cached stale data, returning an empty/default response (e.g., an empty recommendations list), queueing the request for later processing, or redirecting to a secondary service. Netflix Hystrix popularized this pattern with its `getFallback()` method that each command implements.
Fallbacks Must Not Call the Same Dependency
A fallback that calls the same failing service defeats the purpose. Fallbacks should be purely local (return cached data, return a default) or call a genuinely different service.
Implementation Example (Pseudocode)
class CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private readonly threshold = 5, // trip after 5 failures
private readonly resetTimeoutMs = 30_000 // 30 s reset
) {}
async call<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === "open") {
if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
this.state = "half-open";
} else {
return fallback(); // fast-fail
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
return fallback();
}
}
private onSuccess() {
this.failureCount = 0;
this.state = "closed";
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.threshold) {
this.state = "open";
}
}
}Real-World Examples
Netflix Hystrix (now in maintenance mode) was the original popular implementation, wrapping every inter-service call. Resilience4j is its modern successor for the JVM ecosystem, offering count-based and time-based sliding windows. AWS App Mesh and Istio service meshes implement circuit breaking at the infrastructure layer via Envoy proxy, removing the need for library-level code. Spring Cloud Gateway and AWS API Gateway also expose circuit breaker filters.
Interview Tip
Interviewers love hearing you distinguish circuit breaker from retry: retries are for transient errors on individual calls; circuit breakers protect against sustained degradation of a downstream service. Use them together — retries inside a closed circuit, fast-fail when open. Always mention fallback strategies and observability (metrics for state transitions, open duration, fallback invocations).
Observability
Every state transition should emit a metric or log event. Key metrics: `circuit_breaker_state` (gauge: 0=closed, 1=open, 2=half-open), `circuit_breaker_calls_total` labeled by outcome (success/failure/fallback/rejected), and `circuit_breaker_open_duration_seconds`. An alert on `circuit_breaker_state == 1` for more than 60 seconds often indicates a real outage rather than a transient blip.