This article details an eight-hour platform-wide outage experienced by Railway, a platform built on Google Cloud, AWS, and bare-metal, due to an automated suspension of their GCP production account. It highlights critical architectural weaknesses where a single cloud provider became a single point of failure for core services like the network control plane, leading to a cascade across all environments. The incident underscores the importance of true multi-cloud/hybrid-cloud resilience beyond traditional multi-AZ/region strategies.
Read original on InfoQ ArchitectureRailway's outage was triggered by an automated suspension of their Google Cloud production account, which, critically, hosted their network control plane. While workloads on AWS and Railway's bare-metal infrastructure (Railway Metal) initially continued running due to cached routing tables at the edge, the expiration of these caches rendered all services unreachable. This demonstrates a fundamental architectural flaw: even with workloads distributed across multiple providers, a shared control plane or critical dependency on a single provider can create a cascading failure point. The physical workloads were still active, but effectively isolated and inaccessible.
Beware of Hidden Single Points of Failure
Distributing compute resources across multiple clouds or regions is a good start, but ensure your control plane, metadata services, identity management, and critical infrastructure components are also resilient and provider-independent. A seemingly small dependency can bring down your entire distributed system if it resides on a single platform.
The recovery process further exposed architectural interdependencies. Simply restoring account access didn't immediately resolve the issue; persistent disks, compute instances, and networking required separate, phased recovery steps. This complexity was compounded by a backlog of queued deployments that needed careful draining to prevent overwhelming build systems. Additionally, GitHub's rate-limiting of OAuth and webhook integrations due to a burst of retried requests highlighted how external service dependencies can exacerbate an outage, affecting even user authentication and build processes.
In response, Railway plans significant architectural changes to achieve true provider independence. Key remediations include removing Google Cloud from the data plane's hot path, extending high-availability database shards across AWS and Metal, and redesigning the mesh network. The goal is to ensure that even if one interconnect fails, routing tables can still be populated from surviving paths across other providers. This shifts from a multi-provider setup with a single point of failure to a genuinely fault-tolerant, multi-active hybrid cloud architecture.