Chaos Engineering
Proactively test system resilience: chaos experiments, blast radius control, Netflix Chaos Monkey, Gremlin, and building a chaos engineering practice.
From Reactive to Proactive Reliability
Traditional reliability engineering is reactive: wait for production incidents, learn from post-mortems, fix the root cause. Chaos engineering inverts this: deliberately introduce failure in a controlled way to discover weaknesses before they become real outages. The motto is: *'If it hurts, do it more often.'* A system that is never tested under failure is a system whose failure behavior is unknown.
Chaos Engineering vs Stress Testing
Stress testing pushes a system beyond its stated capacity (e.g., 10x normal traffic). Chaos engineering tests the system's resilience to specific failure modes at normal or slightly elevated load — instance crashes, network partitions, dependency failures, disk filling up. They are complementary, not competing.
The Chaos Engineering Process
Types of Chaos Experiments
| Failure Type | Example Experiment | Tests For |
|---|---|---|
| Instance/Pod termination | Kill a random API pod every hour | Auto-healing, replica recovery |
| Network latency | Add 500ms latency to payment service calls | Timeout handling, circuit breaking |
| Packet loss | Drop 20% of packets between services | Retry logic, graceful degradation |
| Disk full | Fill the data volume to 100% | Error handling, alerting |
| Dependency unavailability | Take the cache (Redis) offline | Fallback to database, error messages |
| CPU / memory stress | Peg CPU at 95% on a database node | Autoscaling, replication failover |
| DNS failure | Return NXDOMAIN for a dependent service | Connection timeout handling |
Netflix's Simian Army
Netflix pioneered chaos engineering and open-sourced a suite of tools called the Simian Army. Chaos Monkey terminates random EC2 instances during business hours, forcing engineers to design for instance loss. Chaos Gorilla simulates the failure of an entire AWS availability zone. Latency Monkey introduces artificial delays. Conformity Monkey checks instances against best-practice rules. Security Monkey finds security policy violations.
Netflix runs Chaos Monkey in production during business hours — when engineers are at their desks to respond. The philosophy: if you're going to experience failure anyway (and you will), better to experience it on your schedule with your best engineers available.
Tools: Gremlin and Chaos Mesh
| Tool | Type | Key Features |
|---|---|---|
| Chaos Monkey | Open-source (Netflix) | EC2 instance termination, Spinnaker integration |
| Gremlin | Commercial SaaS | GUI, attack library, blast radius controls, GameDay planning |
| Chaos Mesh | Open-source (CNCF) | Kubernetes-native, pod/network/time chaos, web UI |
| LitmusChaos | Open-source (CNCF) | Kubernetes-native, ChaosHub with pre-built experiments |
| Pumba | Open-source | Docker container chaos: kill, pause, network |
Prerequisites for Safe Chaos Engineering
- Strong observability — you need metrics, logs, and traces to define steady state and detect deviation
- Defined steady state — concrete, measurable success criteria (e.g., p99 latency < 300ms, error rate < 0.5%)
- Automated kill switch — ability to abort the experiment instantly if steady state is violated
- Runbooks — engineers know how to respond to the failure mode being tested
- Start small — begin in staging, limit blast radius (canary percentage or specific namespace), then graduate to production
- GameDays — scheduled chaos exercises where teams practice responding to failures together
Example: Chaos Mesh network latency experiment
Inject 200ms latency on all traffic from the `checkout-service` to the `payment-service`. Hypothesis: the checkout service's 500ms timeout will trigger a circuit breaker, and users will see a graceful 'payment unavailable, try again' message rather than a hanging request. Expected: error rate stays below 2%, no 5xx responses escape to the client. Abort condition: error rate exceeds 5%.
Interview Tip
If asked 'how would you improve system reliability beyond standard redundancy?' — describe chaos engineering. Key points: (1) define steady state with SLOs, (2) form a hypothesis about a specific failure mode, (3) limit blast radius (start in staging), (4) inject failure with automated rollback if steady state is violated, (5) fix weaknesses discovered. Mention that chaos engineering requires strong observability as a prerequisite — this shows architectural depth.