This article from Airbnb Engineering discusses critical architectural decisions for building highly reliable monitoring systems. It focuses on eliminating circular dependencies where observability infrastructure relies on the very systems it's meant to monitor. Key strategies involve isolating compute, decoupling networking from the main service mesh, and implementing robust meta-monitoring with a dead man's switch to ensure visibility during incidents.
Read original on Airbnb EngineeringReliable observability is paramount for incident response, but often monitoring systems inadvertently create circular dependencies by running on the same infrastructure they observe. This article details Airbnb's approach to breaking these dependencies, ensuring their observability stack remains functional even when core production systems are experiencing failures. The core principle is to make the observability stack more reliable than the systems it monitors by isolating failure domains.
The primary challenge identified was that Airbnb's metrics pipeline was built on the same systems it observed. This meant that during an outage, the tools designed to aid recovery would often become unavailable themselves. The solution centered on providing a redundant, highly available path for collecting metrics, deliberately isolating observability components from general production infrastructure.
Instead of running observability components on shared production Kubernetes clusters (which introduces circular dependencies) or operating entirely separate clusters (high operational overhead), Airbnb opted for dedicated Kubernetes clusters managed by their internal Cloud team. These clusters are not shared with product or infrastructure applications, thereby reducing shared failure domains while still leveraging a managed platform. This balance minimizes operational burden for the observability team while providing necessary isolation.
Observability data generates orders of magnitude more traffic than typical business traffic. Relying on the same service mesh (Istio, in Airbnb's case) for both observability and business traffic introduced several problems:
To address this, Airbnb built a custom Layer 7 network ingress layer based on Envoy. This proxy runs independently of the shared compute and service mesh, providing fault tolerance, custom routing (e.g., header-based routing for tenant identification), strict prioritization for telemetry, and fine-grained access controls. This allows the observability team to own and optimize their network path for telemetry without managing the entire Kubernetes compute layer themselves.
Monitoring the Monitors
Meta-monitoring is crucial for ensuring the observability stack itself is healthy. Without it, you could experience a 'silent failure' where monitoring tools fail without anyone noticing, leading to delayed incident detection.
To monitor the monitors, Airbnb employs a separate set of Prometheus instances and Alertmanagers. These run on Kubernetes nodes isolated from the primary observability stack and in different availability zones to prevent correlated failures. To address the recursive problem of 'who monitors the meta-monitors?', they implement a Dead Man's Switch. An alerting rule continuously fires as long as Prometheus is scraping correctly. These alerts are sent to an external AWS SNS topic, monitored by a CloudWatch alarm. If the stream of alerts stops, the CloudWatch alarm triggers an on-call page, signaling a failure in the meta-monitoring layer itself.