This article details a complex bug in Cloudflare's QUIC implementation of the CUBIC congestion control algorithm. It highlights how a Linux kernel optimization for handling idle TCP connections led to a critical issue in QUIC where the congestion window permanently collapsed, severely impacting data transfer rates. The case study provides insights into the intricacies of congestion control, the challenges of porting kernel-level optimizations to user-space implementations, and systematic debugging of performance-critical network protocols.
Read original on Cloudflare BlogCongestion control algorithms (CCAs) like CUBIC are fundamental to how TCP and QUIC connections manage network bandwidth, detect loss, and recover from congestion. They determine the congestion window (cwnd) — the maximum amount of data in flight at any given moment. A larger cwnd enables higher throughput, while a smaller one throttles data transfer. Loss-based algorithms increase the sending rate when the network is healthy and decrease it upon detecting packet loss, assuming congestion. This article explores a specific bug where CUBIC's cwnd gets permanently pinned at its minimum, leading to connection stalls.
Cloudflare observed unexpected failures in integration tests for their `quiche` QUIC implementation. Specifically, in scenarios with heavy early packet loss, CUBIC failed to recover, with downloads timing out 60% of the time. Investigation revealed that after the loss stopped, the cwnd remained at its minimum (two full-size packets), and the congestion state rapidly oscillated between "recovery" and "congestion avoidance" states, once per RTT. This behavior contradicted CUBIC's core logic: in the absence of loss, the cwnd should grow to utilize available bandwidth.
The bug originated from a Linux kernel optimization for TCP CUBIC in 2017, designed to prevent cwnd inflation after an application goes idle. The fix adjusted the `epoch_start` timestamp, which CUBIC uses to anchor its growth curve, to shift it forward by the idle duration instead of resetting it. This preserved the shape of the growth curve. When this optimization was ported to `quiche`'s user-space QUIC implementation, a subtle difference in how `bytes_in_flight == 0` was handled (in `on_packet_sent` vs. kernel's `CA_EVENT_TX_START` callback) exposed a flaw. An unaddressed follow-up kernel fix was missed during the port.
The Self-Perpetuating Recovery Trap
The flaw caused `congestion_recovery_start_time` to be pushed into the future during ACK processing. At minimum cwnd (two packets), every ACK cycle would trigger a false idle detection, leading to the `congestion_recovery_start_time` being incorrectly advanced. This creates a "death spiral" where the connection constantly re-enters a recovery state, preventing cwnd growth and permanently pinning it at the minimum.