Course/Infrastructure & DevOps/Chaos Engineering

Chaos Engineering

Proactively test system resilience: chaos experiments, blast radius control, Netflix Chaos Monkey, Gremlin, and building a chaos engineering practice.

10 min read

From Reactive to Proactive Reliability

Traditional reliability engineering is reactive: wait for production incidents, learn from post-mortems, fix the root cause. Chaos engineering inverts this: deliberately introduce failure in a controlled way to discover weaknesses before they become real outages. The motto is: *'If it hurts, do it more often.'* A system that is never tested under failure is a system whose failure behavior is unknown.

ℹ️

Chaos Engineering vs Stress Testing

Stress testing pushes a system beyond its stated capacity (e.g., 10x normal traffic). Chaos engineering tests the system's resilience to specific failure modes at normal or slightly elevated load — instance crashes, network partitions, dependency failures, disk filling up. They are complementary, not competing.

The Chaos Engineering Process

Loading diagram...

Chaos engineering cycle: define steady state, form hypothesis, run experiment, analyze, fix.

Types of Chaos Experiments

Failure Type	Example Experiment	Tests For
Instance/Pod termination	Kill a random API pod every hour	Auto-healing, replica recovery
Network latency	Add 500ms latency to payment service calls	Timeout handling, circuit breaking
Packet loss	Drop 20% of packets between services	Retry logic, graceful degradation
Disk full	Fill the data volume to 100%	Error handling, alerting
Dependency unavailability	Take the cache (Redis) offline	Fallback to database, error messages
CPU / memory stress	Peg CPU at 95% on a database node	Autoscaling, replication failover
DNS failure	Return NXDOMAIN for a dependent service	Connection timeout handling

Netflix's Simian Army

Netflix pioneered chaos engineering and open-sourced a suite of tools called the Simian Army. Chaos Monkey terminates random EC2 instances during business hours, forcing engineers to design for instance loss. Chaos Gorilla simulates the failure of an entire AWS availability zone. Latency Monkey introduces artificial delays. Conformity Monkey checks instances against best-practice rules. Security Monkey finds security policy violations.

Netflix runs Chaos Monkey in production during business hours — when engineers are at their desks to respond. The philosophy: if you're going to experience failure anyway (and you will), better to experience it on your schedule with your best engineers available.

Tools: Gremlin and Chaos Mesh

Tool	Type	Key Features
Chaos Monkey	Open-source (Netflix)	EC2 instance termination, Spinnaker integration
Gremlin	Commercial SaaS	GUI, attack library, blast radius controls, GameDay planning
Chaos Mesh	Open-source (CNCF)	Kubernetes-native, pod/network/time chaos, web UI
LitmusChaos	Open-source (CNCF)	Kubernetes-native, ChaosHub with pre-built experiments
Pumba	Open-source	Docker container chaos: kill, pause, network

Prerequisites for Safe Chaos Engineering

Strong observability — you need metrics, logs, and traces to define steady state and detect deviation
Defined steady state — concrete, measurable success criteria (e.g., p99 latency < 300ms, error rate < 0.5%)
Automated kill switch — ability to abort the experiment instantly if steady state is violated
Runbooks — engineers know how to respond to the failure mode being tested
Start small — begin in staging, limit blast radius (canary percentage or specific namespace), then graduate to production
GameDays — scheduled chaos exercises where teams practice responding to failures together

📌

Example: Chaos Mesh network latency experiment

Inject 200ms latency on all traffic from the `checkout-service` to the `payment-service`. Hypothesis: the checkout service's 500ms timeout will trigger a circuit breaker, and users will see a graceful 'payment unavailable, try again' message rather than a hanging request. Expected: error rate stays below 2%, no 5xx responses escape to the client. Abort condition: error rate exceeds 5%.

💡

Interview Tip

If asked 'how would you improve system reliability beyond standard redundancy?' — describe chaos engineering. Key points: (1) define steady state with SLOs, (2) form a hypothesis about a specific failure mode, (3) limit blast radius (start in staging), (4) inject failure with automated rollback if steady state is violated, (5) fix weaknesses discovered. Mention that chaos engineering requires strong observability as a prerequisite — this shows architectural depth.

Observability: Metrics, Logs & Traces

OAuth 2.0 & JWT