This article delves into the critical difference between average latency and percentile latency in production systems, highlighting how resource contention can drastically impact tail latencies. It explores the dynamics of queueing effects, saturation, and the importance of monitoring metrics beyond simple averages to build resilient and performant distributed systems.
Read original on Medium #system-designWhen designing and operating distributed systems, engineers often focus on average latency, which can be misleading. Production environments frequently reveal issues like high p99 or p99.9 latencies, indicating a poor experience for a significant fraction of users. These tail latencies are crucial for understanding user satisfaction and system health, especially in microservices architectures where cumulative latencies can quickly degrade overall performance.
Average latency hides the experience of outliers. For instance, an average latency of 120ms might mean most requests are 50ms, but some are 1000ms. In a service-oriented architecture, if multiple services are called serially, even a small percentage of slow requests in one service can lead to a much larger percentage of slow end-user requests. This compounding effect makes monitoring tail latencies (like p99 or p99.9) indispensable for identifying performance bottlenecks.
High tail latencies are frequently a symptom of resource contention. When a system's capacity is approached, requests start to queue, leading to increased latency. This isn't just about CPU; it can involve network I/O, disk I/O, database connections, or thread pools. As utilization nears 100%, the latency increase becomes non-linear and can rapidly spiral out of control.
Little's Law for System Design
Little's Law (L = "lambda"W, where L is average number of items in a queueing system, "lambda" is the average arrival rate, and W is the average time an item spends in the system) demonstrates the relationship between concurrency, throughput, and latency. As you increase concurrency (L) for a given throughput ("lambda"), the average time (W) can rise, especially as resource saturation limits processing speed.
To mitigate these issues, system designers must: 1) Monitor beyond averages, tracking p90, p95, p99, and p99.9 latencies. 2) Implement proper load testing and capacity planning that accounts for saturation points. 3) Design systems with backpressure mechanisms, graceful degradation, and circuit breakers to prevent cascading failures when one component becomes saturated. Understanding saturation dynamics is key to building resilient systems that perform well under load.