Airbnb Engineering·April 21, 2026

Designing a Fault-Tolerant, Multi-Tenant Metrics Storage System at Scale

This article details Airbnb's journey in building a highly scalable and fault-tolerant metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of time series data. It explores architectural decisions for multi-tenancy, operational challenges, and strategies for ensuring reliability and performance at immense scale, including single and multi-cluster deployments.

Distributed Systems Performance & Scaling DevOps & SRE

Read original on Airbnb Engineering

Airbnb faced significant engineering challenges when transitioning from a hosted metrics provider to an internal solution due to their massive scale: 1.3 billion active time series and 50 million samples per second. The core mandate was to persist and serve this data performantly, reliably, and cost-effectively, leading to a focus on multi-tenancy, operational aspects, and distributed architecture.

Multi-Tenancy and Isolation Strategies

To manage the complexity of numerous services, Airbnb opted for assigning tenants per service or process, providing a stable grouping for metric growth attribution and guardrail enforcement. A critical technique employed for fault isolation and performance was shuffle sharding. This ensures that each tenant interacts with only a subset of storage and query nodes, preventing a single tenant from overwhelming the entire system. For example, a DDoS attack from one application would only impact a limited 'shuffled set' of resources, protecting other tenants.

💡

Shuffle Sharding Benefits

Shuffle sharding provides a strong isolation boundary in shared clusters, improving fault tolerance and localizing the impact of failures or misbehaving tenants. It's a key pattern for building robust multi-tenant systems where resource contention is a concern.

Operationalizing at Scale with a Control Plane

Managing a multi-tenant metrics system introduced substantial operational overhead, particularly around tenant onboarding and configuration. Airbnb addressed this by building a consolidated control plane. This automated new tenant onboarding by monitoring service creation and allowed for automatic configuration updates, significantly reducing manual steps and deployment times. It also simplified limit management by exposing only necessary parameters (e.g., series limits) and deriving others (e.g., ingestion rate), streamlining operational workflows.

Achieving High Reliability: Single and Multi-Cluster Architecture

The initial focus was on stabilizing a single cluster. This involved rigorous benchmarking to determine resource usage, implementing per-replica limits for capacity planning, and setting tenant-level write and read guardrails (e.g., max series emitted, fetched series/chunks per query). Query sharding normalized read loads, and critical evaluation query paths were isolated from ad-hoc queries. Stateful components were made zone-aware and deployed across three availability zones to enhance fault tolerance. Once a single cluster was reliable, Airbnb adopted a multi-cluster architecture to create multiple failure domains, reduce blast radius, and enable regional flexibility. This involved dedicated clusters for specialized workloads and application clusters, managed by tooling for tenant-to-cluster mapping and automated deployments using Kubernetes operators. They also leveraged Promxy for cross-cluster querying and alerting.

Cross-cluster querying costs: Federated queries were found to be 5-10x more resource-intensive, necessitating adjustments in tenant consolidation for hot read patterns.
Deployment consistency: Automation and standardized deployments using Kubernetes operators were crucial for managing configuration across numerous stateful clusters and preventing drift.
Cluster management philosophy: The goal was to treat clusters as "cattle, not pets," enabling clusters to be added or replaced with minimal operational overhead through self-tuning and automation.

metricstime seriesobservabilitymulti-tenancyshuffle shardingfault tolerancescalabilityKubernetes

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable, fault-tolerant, multi-tenant metrics storage and querying system capable of handling 50 million samples/second and 1.3 billion active time series. Your design should incorporate strategies for tenant isolation using shuffle sharding, a consolidated control plane for automated tenant onboarding and configuration, and a multi-cluster architecture for disaster recovery and regional expansion. Detail the key components, data flow, scaling mechanisms, and operational considerations to ensure high reliability and performance under unpredictable demand.

Practice Interview

Other design angles

· Design only the multi-tenant ingestion layer for a high-throughput metrics system, focusing on data partitioning, write path optimization, and tenant isolation techniques like shuffle sharding.· Design the query engine for a federated, multi-cluster time-series database, addressing challenges like cross-cluster querying costs, data consistency, and performance for complex analytical queries.· Architect the control plane for a multi-tenant SaaS platform that provisions and manages resources for each tenant across a distributed infrastructure, including automated onboarding, configuration management, and resource guardrails.