Course/Foundations/Reliability, Availability & Fault Tolerance

Reliability, Availability & Fault Tolerance

Learn the differences between reliability and availability, how to measure them with nines, and patterns for building fault-tolerant systems.

15 min readHigh interview weight

Reliability vs Availability

These terms are often used interchangeably in casual conversation, but they measure distinct properties. Getting them right in an interview demonstrates precision.

Property	Definition	Example
Availability	The percentage of time the system is operational and responsive	99.9% uptime = 8.7 hours downtime/year
Reliability	The probability that the system performs its intended function correctly over a time period	99.9% of requests return correct results
Fault Tolerance	The system's ability to continue operating correctly in the presence of component failures	A node crashes but users see no interruption
Durability	The guarantee that committed data is not lost	S3's 99.999999999% (11 nines) durability for stored objects

ℹ️

Durability is not availability

S3 is famous for 11 nines of durability — your data will almost never be lost. But S3 has had outages where it was unavailable (unable to serve requests) while still being durable (data was not destroyed). These are orthogonal guarantees.

The Nines of Availability

Availability is typically expressed as a percentage of uptime. The industry uses a shorthand of 'nines' to describe these percentages:

Nines	Availability	Downtime / Year	Downtime / Month	Typical Use Case
Two nines (99%)	99%	3.65 days	7.3 hours	Internal tools, batch systems
Three nines (99.9%)	99.9%	8.76 hours	43.8 minutes	Standard web services
Four nines (99.99%)	99.99%	52.6 minutes	4.4 minutes	Business-critical systems
Five nines (99.999%)	99.999%	5.3 minutes	26.3 seconds	Carrier-grade telecom, payments

Each additional nine is approximately 10x harder and more expensive to achieve. Going from three nines to four nines is not an incremental improvement — it requires fundamental architectural changes: redundancy at every layer, automatic failover, and very disciplined operational practices.

Why Dependencies Reduce Availability

When a request must pass through multiple services in sequence (in series), overall availability is the product of each component's availability:

text

# System availability for components in series:
Overall = A1 × A2 × A3 × ...

# Example: 3 services each at 99.9%
Overall = 0.999 × 0.999 × 0.999 = 0.997 = 99.7%

# Microservices at scale: 10 services each at 99.9%
Overall = 0.999^10 = 0.990 = 99.0%  ← only two nines!

# For components in parallel (redundant):
Overall = 1 - (1-A)^N
# Two services each at 99%, running in parallel:
Overall = 1 - (0.01 × 0.01) = 1 - 0.0001 = 99.99%  ← four nines!

This math has profound implications. A microservices architecture with 50 services in a critical path, each at 99.9%, would have an overall availability of 95.1% — worse than two nines. This is why high-availability systems are designed with redundancy (parallel paths) and why minimizing synchronous dependencies is crucial.

Fault Tolerance Patterns

Fault tolerance is achieved through a combination of architectural patterns. Understanding these patterns and when to apply them is central to system design interviews.

Redundancy and Replication

The most fundamental fault tolerance technique: run multiple copies of a component so that if one fails, others take over. This can be active-passive (one leader handles traffic, replica takes over on failure) or active-active (multiple nodes handle traffic simultaneously, any can fail without impact).

Loading diagram...

Active-passive failover: the primary handles all traffic; the standby takes over if the primary fails.

Circuit Breaker

Named after the electrical safety device, a circuit breaker monitors calls to a dependency. If failures exceed a threshold, it 'opens' the circuit — subsequent calls fail immediately without attempting the downstream service. After a timeout, it enters a 'half-open' state and allows a probe request through. If that succeeds, the circuit 'closes' and normal operation resumes.

Netflix's Hystrix library (now largely superseded by Resilience4j) popularized circuit breakers in microservices. When Netflix's recommendation service has an outage, the circuit breaker ensures that streaming requests fail fast rather than waiting for a timeout — preserving threads and allowing the core playback experience to continue.

Bulkhead Isolation

Borrowed from ship design (watertight compartments that prevent a single hull breach from sinking the vessel), the bulkhead pattern partitions resources to isolate failures. Different services or tenants get separate thread pools, connection pools, or even separate infrastructure. A spike in one area cannot exhaust resources needed by another.

Timeouts and Retries

Timeouts — Every network call must have a timeout. A slow dependency without a timeout will hold threads indefinitely, cascading into a full system failure.
Retry with exponential backoff — On a transient failure, retry after a short delay. Double the delay on each subsequent retry (backoff). Add jitter (randomness) to prevent all retrying clients from hammering the service simultaneously (thundering herd).
Idempotency — Retries are only safe if the operation is idempotent (repeated calls produce the same result). Payment systems must be especially careful: use idempotency keys to prevent duplicate charges.

Designing for Failure

The most resilient systems assume that every component will fail, and design for graceful degradation. Amazon's approach, formalized in their Chaos Engineering practices, involves deliberately injecting failures in production to verify that the system handles them correctly. Netflix's Chaos Monkey randomly terminates EC2 instances during business hours to ensure services never depend on a single node being alive.

⚠️

Correlated failures are the real danger

Redundancy protects against independent failures. The dangerous scenario is correlated failure — where the same bug, the same power outage, or the same traffic spike takes out all your replicas simultaneously. This is why truly high-availability systems distribute across multiple availability zones and, for the most critical workloads, multiple geographic regions.

💡

Interview Tip

When designing a fault-tolerant system in an interview, address failure at multiple layers: 'Application servers have multiple instances behind a load balancer with health checks. The database uses synchronous replication to a standby in a second AZ with automatic failover. We use circuit breakers to isolate the recommendation service so an outage there doesn't affect checkout. Retries are idempotent with exponential backoff and jitter.' This layered approach demonstrates maturity.

Scalability: Vertical vs Horizontal

CAP Theorem & PACELC