Menu
☁️Cloudflare Blog·February 21, 2026

Cloudflare BYOIP Outage Postmortem: BGP Withdrawal due to Software Bug

This postmortem details a Cloudflare outage caused by an internal software bug leading to the unintentional withdrawal of customer Bring Your Own IP (BYOIP) prefixes via BGP. It highlights system design flaws in automated processes, configuration management, and recovery mechanisms, offering critical lessons in building resilient distributed systems.

Read original on Cloudflare Blog

Cloudflare experienced a significant outage where a software bug in an automated cleanup sub-task led to the withdrawal of approximately 1,100 customer Bring Your Own IP (BYOIP) prefixes via Border Gateway Protocol (BGP). This incident underscores the critical importance of robust change management, comprehensive testing, and resilient recovery strategies in large-scale distributed systems, especially when dealing with core network functionalities.

Root Cause: Flawed Automation and API Misinterpretation

The outage was triggered by a new automated sub-task, part of Cloudflare's 'Code Orange: Fail Small' initiative, intended to remove unneeded BYOIP prefixes. A bug in the API query within this sub-task caused it to misinterpret a request for 'pending deletion' prefixes as a request for *all* BYOIP prefixes. Consequently, the system marked all returned prefixes for deletion, systematically withdrawing them from Cloudflare's network.

go
resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

if v := req.URL.Query().Get("pending_delete"); v != "" {
    // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
    prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
    // ...
}
⚠️

API Design and Implementation Pitfalls

The bug highlights a critical API design flaw: accepting an empty string for a query parameter (e.g., `?pending_delete`) that should differentiate between a boolean flag and a value. In this case, `Query().Get("pending_delete")` returned an empty string, which the API logic then interpreted as a request for *all* prefixes, rather than only those explicitly marked `pending_delete=true` or similar. This led to an unintended destructive operation. Strict parameter validation and clear semantic interpretation are crucial for robust APIs.

System Architecture: Addressing API and BGP Interaction

Cloudflare's Addressing API acts as the authoritative source of truth for IP addresses, with changes immediately propagating to the global network. The process involves customer signals, API instructions to change prefix advertisements, BGP updates on routers, and service bindings to assign products to these ranges. The incident exposed weaknesses in this propagation and configuration management pipeline, particularly where manual processes still existed or where automated processes lacked sufficient validation.

  • Lack of comprehensive testing in staging environments to simulate task-runner service independent execution of changes.
  • Insufficient test coverage for scenarios where automated internal tasks modify user data without explicit customer input.
  • Varying states of impact during recovery due to different levels of data corruption (prefixes withdrawn, bindings removed), necessitating complex and varied mitigation strategies.
postmortembgpnetworkoutageapi-designautomationconfiguration-managementcloud-infrastructure

Comments

Loading comments...