This article discusses strategies for building resilient systems capable of handling significant traffic spikes, drawing insights from SeatGeek's architecture. It introduces a multi-layer defense approach—Edge Shield, Gateway Shield, and Platform Shield—to absorb bursts, control flow, and protect core services. The discussion highlights the importance of early signals, graceful degradation, and resource isolation to prevent system collapse.
Read original on InfoQ ArchitectureThe article
SeatGeek operates in an environment prone to "traffic stampedes" where demand arrives faster than a system can adapt, leading to potential collapse. Key indicators of system stress include the "Noisy Neighbor Problem" in multi-tenant systems and the "Scaling Gap," the period where scaling mechanisms lag behind demand. The core strategy to address this is a threefold approach: Absorb the Burst, Control the Flow, and Protect the Core.
SeatGeek employs a multi-shield defense system, each layer with distinct responsibilities to ensure resilience:
Resilience Principles
A resilient system is built on four core principles: Composition (resilience is layered), Protect the Core (preserve critical paths), Observe Pressure (signals reveal stress), and Controlled Failure (fail gracefully). Early and accurate signals are crucial for faster system adaptation and preventing collapse.
The flow of signals and scaling involves a spike in traffic leading to an increase in queue size, which triggers scaling mechanisms (like HPA) to increase capacity, ultimately reducing the queue size. Signals from all three defense layers contribute to a faster response.