This article details Airbnb's journey in building a highly scalable and fault-tolerant metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of time series data. It explores architectural decisions for multi-tenancy, operational challenges, and strategies for ensuring reliability and performance at immense scale, including single and multi-cluster deployments.
Read original on Airbnb EngineeringAirbnb faced significant engineering challenges when transitioning from a hosted metrics provider to an internal solution due to their massive scale: 1.3 billion active time series and 50 million samples per second. The core mandate was to persist and serve this data performantly, reliably, and cost-effectively, leading to a focus on multi-tenancy, operational aspects, and distributed architecture.
To manage the complexity of numerous services, Airbnb opted for assigning tenants per service or process, providing a stable grouping for metric growth attribution and guardrail enforcement. A critical technique employed for fault isolation and performance was shuffle sharding. This ensures that each tenant interacts with only a subset of storage and query nodes, preventing a single tenant from overwhelming the entire system. For example, a DDoS attack from one application would only impact a limited 'shuffled set' of resources, protecting other tenants.
Shuffle Sharding Benefits
Shuffle sharding provides a strong isolation boundary in shared clusters, improving fault tolerance and localizing the impact of failures or misbehaving tenants. It's a key pattern for building robust multi-tenant systems where resource contention is a concern.
Managing a multi-tenant metrics system introduced substantial operational overhead, particularly around tenant onboarding and configuration. Airbnb addressed this by building a consolidated control plane. This automated new tenant onboarding by monitoring service creation and allowed for automatic configuration updates, significantly reducing manual steps and deployment times. It also simplified limit management by exposing only necessary parameters (e.g., series limits) and deriving others (e.g., ingestion rate), streamlining operational workflows.
The initial focus was on stabilizing a single cluster. This involved rigorous benchmarking to determine resource usage, implementing per-replica limits for capacity planning, and setting tenant-level write and read guardrails (e.g., max series emitted, fetched series/chunks per query). Query sharding normalized read loads, and critical evaluation query paths were isolated from ad-hoc queries. Stateful components were made zone-aware and deployed across three availability zones to enhance fault tolerance. Once a single cluster was reliable, Airbnb adopted a multi-cluster architecture to create multiple failure domains, reduce blast radius, and enable regional flexibility. This involved dedicated clusters for specialized workloads and application clusters, managed by tooling for tenant-to-cluster mapping and automated deployments using Kubernetes operators. They also leveraged Promxy for cross-cluster querying and alerting.