Reliability, Availability & Fault Tolerance
Learn the differences between reliability and availability, how to measure them with nines, and patterns for building fault-tolerant systems.
Reliability vs Availability
These terms are often used interchangeably in casual conversation, but they measure distinct properties. Getting them right in an interview demonstrates precision.
| Property | Definition | Example |
|---|---|---|
| Availability | The percentage of time the system is operational and responsive | 99.9% uptime = 8.7 hours downtime/year |
| Reliability | The probability that the system performs its intended function correctly over a time period | 99.9% of requests return correct results |
| Fault Tolerance | The system's ability to continue operating correctly in the presence of component failures | A node crashes but users see no interruption |
| Durability | The guarantee that committed data is not lost | S3's 99.999999999% (11 nines) durability for stored objects |
Durability is not availability
S3 is famous for 11 nines of durability — your data will almost never be lost. But S3 has had outages where it was unavailable (unable to serve requests) while still being durable (data was not destroyed). These are orthogonal guarantees.
The Nines of Availability
Availability is typically expressed as a percentage of uptime. The industry uses a shorthand of 'nines' to describe these percentages:
| Nines | Availability | Downtime / Year | Downtime / Month | Typical Use Case |
|---|---|---|---|---|
| Two nines (99%) | 99% | 3.65 days | 7.3 hours | Internal tools, batch systems |
| Three nines (99.9%) | 99.9% | 8.76 hours | 43.8 minutes | Standard web services |
| Four nines (99.99%) | 99.99% | 52.6 minutes | 4.4 minutes | Business-critical systems |
| Five nines (99.999%) | 99.999% | 5.3 minutes | 26.3 seconds | Carrier-grade telecom, payments |
Each additional nine is approximately 10x harder and more expensive to achieve. Going from three nines to four nines is not an incremental improvement — it requires fundamental architectural changes: redundancy at every layer, automatic failover, and very disciplined operational practices.
Why Dependencies Reduce Availability
When a request must pass through multiple services in sequence (in series), overall availability is the product of each component's availability:
# System availability for components in series:
Overall = A1 × A2 × A3 × ...
# Example: 3 services each at 99.9%
Overall = 0.999 × 0.999 × 0.999 = 0.997 = 99.7%
# Microservices at scale: 10 services each at 99.9%
Overall = 0.999^10 = 0.990 = 99.0% ← only two nines!
# For components in parallel (redundant):
Overall = 1 - (1-A)^N
# Two services each at 99%, running in parallel:
Overall = 1 - (0.01 × 0.01) = 1 - 0.0001 = 99.99% ← four nines!This math has profound implications. A microservices architecture with 50 services in a critical path, each at 99.9%, would have an overall availability of 95.1% — worse than two nines. This is why high-availability systems are designed with redundancy (parallel paths) and why minimizing synchronous dependencies is crucial.
Fault Tolerance Patterns
Fault tolerance is achieved through a combination of architectural patterns. Understanding these patterns and when to apply them is central to system design interviews.
Redundancy and Replication
The most fundamental fault tolerance technique: run multiple copies of a component so that if one fails, others take over. This can be active-passive (one leader handles traffic, replica takes over on failure) or active-active (multiple nodes handle traffic simultaneously, any can fail without impact).
Circuit Breaker
Named after the electrical safety device, a circuit breaker monitors calls to a dependency. If failures exceed a threshold, it 'opens' the circuit — subsequent calls fail immediately without attempting the downstream service. After a timeout, it enters a 'half-open' state and allows a probe request through. If that succeeds, the circuit 'closes' and normal operation resumes.
Netflix's Hystrix library (now largely superseded by Resilience4j) popularized circuit breakers in microservices. When Netflix's recommendation service has an outage, the circuit breaker ensures that streaming requests fail fast rather than waiting for a timeout — preserving threads and allowing the core playback experience to continue.
Bulkhead Isolation
Borrowed from ship design (watertight compartments that prevent a single hull breach from sinking the vessel), the bulkhead pattern partitions resources to isolate failures. Different services or tenants get separate thread pools, connection pools, or even separate infrastructure. A spike in one area cannot exhaust resources needed by another.
Timeouts and Retries
- Timeouts — Every network call must have a timeout. A slow dependency without a timeout will hold threads indefinitely, cascading into a full system failure.
- Retry with exponential backoff — On a transient failure, retry after a short delay. Double the delay on each subsequent retry (backoff). Add jitter (randomness) to prevent all retrying clients from hammering the service simultaneously (thundering herd).
- Idempotency — Retries are only safe if the operation is idempotent (repeated calls produce the same result). Payment systems must be especially careful: use idempotency keys to prevent duplicate charges.
Designing for Failure
The most resilient systems assume that every component will fail, and design for graceful degradation. Amazon's approach, formalized in their Chaos Engineering practices, involves deliberately injecting failures in production to verify that the system handles them correctly. Netflix's Chaos Monkey randomly terminates EC2 instances during business hours to ensure services never depend on a single node being alive.
Correlated failures are the real danger
Redundancy protects against independent failures. The dangerous scenario is correlated failure — where the same bug, the same power outage, or the same traffic spike takes out all your replicas simultaneously. This is why truly high-availability systems distribute across multiple availability zones and, for the most critical workloads, multiple geographic regions.
Interview Tip
When designing a fault-tolerant system in an interview, address failure at multiple layers: 'Application servers have multiple instances behind a load balancer with health checks. The database uses synchronous replication to a standby in a second AZ with automatic failover. We use circuit breakers to isolate the recommendation service so an outage there doesn't affect checkout. Retries are idempotent with exponential backoff and jitter.' This layered approach demonstrates maturity.