This article discusses Cloudflare's implementation of Dynamic Path MTU Discovery (PMTUD) within the Cloudflare One Client to enhance network resilience. It addresses the "PMTUD Black Hole" problem, where intermediate network devices silently drop oversized packets without providing feedback, leading to connection timeouts. By actively probing and dynamically adjusting packet sizes, Cloudflare ensures stable and uninterrupted connectivity, particularly for modern security protocols with increased overhead navigating diverse and often legacy network infrastructures.
Read original on Cloudflare BlogThe Maximum Transmission Unit (MTU) defines the largest packet size a network device can send without fragmentation. While standard Ethernet often uses a 1500-byte MTU, many specialized or legacy networks (like LTE/5G, satellite, or older routers) have lower limits. Modern security protocols and VPNs, such as those used by the Cloudflare One Client, add metadata and encryption overhead, often pushing packet sizes beyond these lower MTU limits. The "PMTUD Black Hole" occurs when an intermediate network device receives an oversized packet but, instead of sending an ICMP "Destination Unreachable" message back to the sender, it silently drops the packet. This lack of feedback causes the sender to continuously transmit packets that never arrive, leading to application timeouts and perceived connection issues.
Cloudflare's Dynamic Path MTU Discovery (PMTUD) implementation, leveraging the MASQUE protocol built on QUIC, shifts from passive reliance on ICMP feedback to active, end-to-end network path interrogation. Instead of waiting for potentially dropped error messages, the Cloudflare One Client proactively sends encrypted packets of varying sizes to the Cloudflare edge. This probing mechanism allows the client to identify the maximum viable MTU for a given path without disruption.
If a probe of a specific size is acknowledged by the Cloudflare edge, it indicates that MTU is supported. If a probe is lost, the client immediately knows that size is too large for a network segment. This process enables the client to dynamically resize its virtual interface MTU on the fly, adapting to changing network conditions (e.g., moving from Wi-Fi to cellular) and ensuring seamless application sessions.
System Design Implication: Robustness through Active Probing
Relying on passive error reporting (like ICMP) in distributed systems introduces fragility, as intermediate network components might interfere with or drop these critical messages. Implementing active probing or handshake mechanisms, as Cloudflare does with PMTUD, makes systems more resilient to such network "black holes" by explicitly verifying path capabilities rather than assuming successful communication of errors.