This article explores the critical distinction between slow requests (stragglers) and failed requests in distributed systems, particularly in fan-out architectures. It details how stragglers, often overlooked by traditional monitoring, significantly contribute to high p99 latency and how naive retries exacerbate the problem. The core solution presented is an adaptive hedging mechanism that proactively sends backup requests based on real-time latency distributions, while employing a token bucket budget to prevent load amplification during genuine outages.
Read original on InfoQ ArchitectureIn microservice architectures, especially those with significant fan-out (where a single user request calls multiple downstream services), individual service health metrics can be misleading. A low straggler rate (e.g., 1% slow requests) in individual services can accumulate dramatically at the system level. For instance, with 100 downstream services, a 1% straggler rate per service means over 63% of top-level requests will be impacted by at least one straggler, leading to severe p99 (99th percentile) latency degradation. This phenomenon is why system-wide tail latency often remains high even when all individual services appear healthy.
The article emphasizes the difference between a failure (a request that doesn't complete) and a straggler (a request that completes, but slowly, due to issues like GC pauses, hot partitions, or kernel blips). While both impact p99 latency, they require different solutions:
The effectiveness of hedging depends on knowing *when* to hedge. Static thresholds are brittle in production environments where latency distributions constantly shift. The proposed solution uses an adaptive mechanism built around DDSketch for real-time, per-host latency tracking. DDSketch is a streaming quantile sketch that provides O(1) constant-memory quantile estimation with relative-error guarantees, allowing it to accurately identify the p90 latency (or another target percentile) of the current distribution for each backend host.
Key Components of Adaptive Hedging
DDSketch: Estimates real-time latency quantiles (e.g., p90) for each target host. It uses logarithmic bucketing and a tumbling window (e.g., two sketches rotating every 30 seconds) to adapt to changing conditions and age out stale data without requiring manual tuning. Token Bucket Budget: Prevents load amplification during genuine outages. It caps the hedge rate at a configurable percentage of total traffic (e.g., 10%). If all requests are slow, the budget quickly exhausts, stopping hedging and allowing graceful degradation rather than doubling load on an already overwhelmed backend.