Menu
Airbnb Engineering·April 21, 2026

Designing a Fault-Tolerant, Multi-Tenant Metrics Storage System at Scale

This article details Airbnb's journey in building a highly scalable and fault-tolerant metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of time series data. It explores architectural decisions for multi-tenancy, operational challenges, and strategies for ensuring reliability and performance at immense scale, including single and multi-cluster deployments.

Read original on Airbnb Engineering

Airbnb faced significant engineering challenges when transitioning from a hosted metrics provider to an internal solution due to their massive scale: 1.3 billion active time series and 50 million samples per second. The core mandate was to persist and serve this data performantly, reliably, and cost-effectively, leading to a focus on multi-tenancy, operational aspects, and distributed architecture.

Multi-Tenancy and Isolation Strategies

To manage the complexity of numerous services, Airbnb opted for assigning tenants per service or process, providing a stable grouping for metric growth attribution and guardrail enforcement. A critical technique employed for fault isolation and performance was shuffle sharding. This ensures that each tenant interacts with only a subset of storage and query nodes, preventing a single tenant from overwhelming the entire system. For example, a DDoS attack from one application would only impact a limited 'shuffled set' of resources, protecting other tenants.

💡

Shuffle Sharding Benefits

Shuffle sharding provides a strong isolation boundary in shared clusters, improving fault tolerance and localizing the impact of failures or misbehaving tenants. It's a key pattern for building robust multi-tenant systems where resource contention is a concern.

Operationalizing at Scale with a Control Plane

Managing a multi-tenant metrics system introduced substantial operational overhead, particularly around tenant onboarding and configuration. Airbnb addressed this by building a consolidated control plane. This automated new tenant onboarding by monitoring service creation and allowed for automatic configuration updates, significantly reducing manual steps and deployment times. It also simplified limit management by exposing only necessary parameters (e.g., series limits) and deriving others (e.g., ingestion rate), streamlining operational workflows.

Achieving High Reliability: Single and Multi-Cluster Architecture

The initial focus was on stabilizing a single cluster. This involved rigorous benchmarking to determine resource usage, implementing per-replica limits for capacity planning, and setting tenant-level write and read guardrails (e.g., max series emitted, fetched series/chunks per query). Query sharding normalized read loads, and critical evaluation query paths were isolated from ad-hoc queries. Stateful components were made zone-aware and deployed across three availability zones to enhance fault tolerance. Once a single cluster was reliable, Airbnb adopted a multi-cluster architecture to create multiple failure domains, reduce blast radius, and enable regional flexibility. This involved dedicated clusters for specialized workloads and application clusters, managed by tooling for tenant-to-cluster mapping and automated deployments using Kubernetes operators. They also leveraged Promxy for cross-cluster querying and alerting.

  • Cross-cluster querying costs: Federated queries were found to be 5-10x more resource-intensive, necessitating adjustments in tenant consolidation for hot read patterns.
  • Deployment consistency: Automation and standardized deployments using Kubernetes operators were crucial for managing configuration across numerous stateful clusters and preventing drift.
  • Cluster management philosophy: The goal was to treat clusters as "cattle, not pets," enabling clusters to be added or replaced with minimal operational overhead through self-tuning and automation.
metricstime seriesobservabilitymulti-tenancyshuffle shardingfault tolerancescalabilityKubernetes

Comments

Loading comments...