Cloudflare's incident with core unit boot times escalating from minutes to hours highlights critical considerations in managing bare-metal infrastructure. The core issue stemmed from inefficient network boot processes and firmware quirks, leading to substantial operational overhead. This case study details their methodical approach to diagnosing and resolving these issues, offering insights into automation, vendor collaboration, and UEFI intricacies for maintaining fleet efficiency.
Read original on Cloudflare BlogCloudflare encountered a significant challenge where bare-metal servers in their core data centers experienced boot times increasing from minutes to nearly four hours. This drastic increase impacted operational efficiency, maintenance windows, and the cost of managing their infrastructure. The problem was particularly evident during firmware updates and when bringing powered-off nodes back online, underscoring the complexities of managing a large, distributed bare-metal fleet.
Servers in Cloudflare's core infrastructure rely on network boot interfaces (primarily PXE and UEFI HTTPS boot via iPXE) to load their operating systems. This method is crucial for centralized, automated, and scalable control across a diverse fleet. The issue arose when a firmware update caused servers to perform a linear search through all available network boot interfaces, timing out on each incorrect one before finally finding the correct interface. Each failed attempt added approximately five minutes, compounding into twenty minutes of wasted time per boot cycle and nearly four hours for multi-reboot firmware upgrades.
System Design Implication: Automated vs. Optimized Boot
While network booting provides excellent automation capabilities for large fleets, its default behavior can lead to significant performance bottlenecks if not meticulously optimized. Relying on default search behaviors for critical components can introduce unpredictable delays and operational overhead, especially in scenarios requiring frequent reboots or large-scale deployments.
To mitigate the issue, Cloudflare implemented several architectural changes and automation strategies:
| Metric | Before ordering change | After ordering change |
|---|
The successful resolution of this problem demonstrates the importance of deep-level infrastructure understanding, strong vendor partnerships, and robust automation for managing large-scale bare-metal environments. It highlights how even low-level firmware interactions can significantly impact the overall efficiency and reliability of a distributed system.
# construct path to read the update variable set buffer-var-guid 91468514-75bc-4bb5-8f33-91efff9e9b1f set var-upd-path efivar/CfHIIVarUpd-${buffer-var-guid} #Run the config change command imgexec <signed CF UEFI configuration App> set ${uefi-setting}=${uefi-value} #Compare the update variable with the expected value if it has changed. #If it has changed, set the local variable to reboot the system iseq ${uefi-same-hex} ${${var-upd-path}} || set has-changed ${uefi-diff-hex}