Meta Engineering·June 3, 2026

Meta's Instant Power Loss Readiness: Resilient Data Center Architecture

This article details Meta's Instantaneous PowerLoss Storm testing paradigm, a crucial initiative to validate and enhance the resilience of their data centers against zero-notice power outages. It explores the defense-in-depth strategies implemented across Meta's infrastructure, from hardware to their Twine orchestrator, and highlights architectural challenges such as circular dependencies and the 'boomerang' effect during system bootstrapping and recovery. The piece also discusses the trade-offs made to balance reliability with engineering velocity and the incremental validation approach used to test readiness in production environments.

Distributed Systems Cloud & Infrastructure DevOps & SRE

Read original on Meta Engineering

Meta's approach to instantaneous power loss readiness is a testament to designing highly resilient distributed systems. Their "Instantaneous PowerLoss Storm" is a rigorous testing paradigm within their existing Disaster Readiness (DR) "Storm" program, designed to simulate and mitigate immediate, unannounced power failures across entire data center regions. This initiative focuses on building defense-in-depth strategies into every layer of their infrastructure, ensuring system continuity even in the face of catastrophic events.

Defense-in-Depth Strategies for Power Loss Tolerance

The capability to handle instant power loss is integrated from the ground up, spanning mechanical and electrical facilities, server racks, storage, compute, and their core Twine container orchestrator. Key architectural components include battery-backed in-memory data persistence and a robust region-wide asynchronous signaling mechanism (Unavailability Events - UEs) for Twine services to orchestrate graceful shutdowns and recovery. The challenge escalates significantly when dealing with an entire data center region, which is 50-60 times larger than typical fault domains, introducing complexities of scale, replica placement, and autonomous bootstrapping.

Overcoming Bootstrapping Challenges

Bootstrapping a powered-off region requires millions of services to start simultaneously and discover each other autonomously. Meta identified two prominent issues during this process:

Circular Dependencies (Ouroboros Risk): Critical control plane services for the Twine orchestrator (Scheduler, Allocator, Broker, Zelos) can't run without each other, creating a 'chicken and egg' problem during a cold start. Meta addressed this by identifying critical startup dependencies, using Belljar tests in CI/CD pipelines for early detection, and implementing a purpose-built Twine recovery kit (Twrko) to 'jumpstart' these core services.
Boomerang Problem: Unavailability Events (UEs), intended to orchestrate service shutdown and recovery, paradoxically ended up shutting down the orchestrator control plane services themselves. The solution was to allow control plane services to simply ignore power-related UE shutdown signals, preventing them from being orphaned.

Trade-offs and Validation

Building watertight tolerance for instant power loss can lead to over-engineering and opportunity costs. Meta established clear boundaries for tolerable vs. intolerable impacts. Data loss, permanent facility damage, or sustained impact beyond a single region were deemed unacceptable. Tolerable risks included transient service errors, bounded staleness in routing tables, and rack failures within predefined thresholds. Validation involved an incremental approach, starting with self-contained problems in new/pre-production regions, progressing to 'shadow' regions, and finally de-energizing large production regions housing critical workloads to simulate real-world scenarios without prior preemptive actions. This iterative process has fostered architectural improvements and enhanced organizational readiness.

💡

Key System Design Takeaways

This article underscores the importance of resilience engineering at scale. Key lessons include designing for defense-in-depth, meticulously addressing bootstrapping sequences and dependency management in distributed systems, understanding and explicitly defining tolerable failure modes, and implementing rigorous, incremental testing (including chaos engineering principles like 'Storms') to validate disaster recovery capabilities in production.

disaster recoveryresilience engineeringdata centerpower lossbootstrappingcircular dependencieschaos engineeringsite reliability engineering

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly resilient data center infrastructure that can autonomously recover from an instantaneous, zero-notice region-wide power loss with minimal data loss and service disruption. Focus on architectural strategies for handling critical service dependencies, ensuring correct startup sequencing, and mitigating 'boomerang' effects during recovery. Detail the defense-in-depth mechanisms, from hardware to orchestration, required to achieve high availability for critical services like storage, AI, and data warehousing.

Practice Interview

Other design angles

· Design a highly available distributed control plane for a container orchestrator that can self-heal and manage circular dependencies during a cold start of an entire data center region.· Propose a comprehensive disaster recovery testing strategy, including incremental validation and chaos engineering principles, for a large-scale cloud infrastructure to ensure resilience against unexpected region-wide failures.· Design a system to manage and persist in-memory data across multiple server racks, ensuring consistency and availability during instantaneous power loss scenarios.