InfoQ Architecture·May 30, 2026

Architecting for Cloud Provider Resiliency: Lessons from Railway's Google Cloud Outage

This article details an eight-hour platform-wide outage experienced by Railway, a platform built on Google Cloud, AWS, and bare-metal, due to an automated suspension of their GCP production account. It highlights critical architectural weaknesses where a single cloud provider became a single point of failure for core services like the network control plane, leading to a cascade across all environments. The incident underscores the importance of true multi-cloud/hybrid-cloud resilience beyond traditional multi-AZ/region strategies.

Cloud & Infrastructure Distributed Systems Case Studies & Postmortems

Read original on InfoQ Architecture

The Cascade of a Single Point of Failure

Railway's outage was triggered by an automated suspension of their Google Cloud production account, which, critically, hosted their network control plane. While workloads on AWS and Railway's bare-metal infrastructure (Railway Metal) initially continued running due to cached routing tables at the edge, the expiration of these caches rendered all services unreachable. This demonstrates a fundamental architectural flaw: even with workloads distributed across multiple providers, a shared control plane or critical dependency on a single provider can create a cascading failure point. The physical workloads were still active, but effectively isolated and inaccessible.

⚠️

Beware of Hidden Single Points of Failure

Distributing compute resources across multiple clouds or regions is a good start, but ensure your control plane, metadata services, identity management, and critical infrastructure components are also resilient and provider-independent. A seemingly small dependency can bring down your entire distributed system if it resides on a single platform.

Recovery Challenges and Interdependencies

The recovery process further exposed architectural interdependencies. Simply restoring account access didn't immediately resolve the issue; persistent disks, compute instances, and networking required separate, phased recovery steps. This complexity was compounded by a backlog of queued deployments that needed careful draining to prevent overwhelming build systems. Additionally, GitHub's rate-limiting of OAuth and webhook integrations due to a burst of retried requests highlighted how external service dependencies can exacerbate an outage, affecting even user authentication and build processes.

Architectural Remediation: Towards True Provider Independence

In response, Railway plans significant architectural changes to achieve true provider independence. Key remediations include removing Google Cloud from the data plane's hot path, extending high-availability database shards across AWS and Metal, and redesigning the mesh network. The goal is to ensure that even if one interconnect fails, routing tables can still be populated from surviving paths across other providers. This shifts from a multi-provider setup with a single point of failure to a genuinely fault-tolerant, multi-active hybrid cloud architecture.

Remove single cloud provider from the data plane's hot path.
Extend high-availability database shards across multiple providers (AWS, Metal).
Redesign the mesh network to allow routing table population from surviving paths in case of any interconnect failure (no single control plane dependency).
Ensure user access to database backups and critical system dashboards is available even during a platform-wide outage.

cloud outagemulti-cloud architecturehybrid cloudsingle point of failuredisaster recoveryresiliencenetwork control planesystem interdependencies

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly resilient, multi-cloud platform-as-a-service (PaaS) that minimizes single points of failure across various cloud providers (e.g., GCP, AWS, bare-metal). Focus on a control plane architecture that can withstand an entire cloud provider account suspension without platform-wide outages, ensuring continuous routing, database access, and deployment capabilities for tenant applications.

Practice Interview

Other design angles

· Design a fully autonomous, distributed control plane for a PaaS that operates independently of any single cloud provider's API or authentication systems.· Architect a disaster recovery strategy for a multi-cloud PaaS, specifically addressing account-level suspensions and ensuring tenant data accessibility and service restoration across remaining providers.· Design a network mesh for a hybrid cloud environment that maintains routing and service discovery capabilities even if a critical component or an entire cloud provider becomes unreachable.