Menu
Airbnb Engineering·March 4, 2026

Building a Modern Observability as Code Platform at Airbnb

Airbnb engineered a sophisticated Observability as Code (OaC) platform to address the challenge of validating alert behavior before deployment. This platform integrates local-first development, rich change reports, and bulk historical backtesting, significantly collapsing alert development cycles from weeks to minutes. The system design focuses on compatibility with Prometheus, robust guardrails for execution, and a developer-centric UX to ensure reliability at scale.

Read original on Airbnb Engineering

The Challenge of Alert Development at Scale

At Airbnb's scale, managing 300,000 alerts with an "Observability as Code" (OaC) approach became increasingly difficult. Traditional code review and unit tests could validate syntax but failed to predict how alerts would behave in production. Engineers faced a trade-off: refining alert templates to reduce noise risked missing critical signals, and without pre-deployment validation, the safer but costly option was to deploy alerts side-by-side in production for weeks of observation. This workflow gap led to a slow iteration cycle and a tolerance for noisy alerts, eroding trust and hindering effective incident response.

Airbnb's Solution: A Rebuilt OaC Platform

Airbnb re-architected its OaC platform to provide fast feedback loops and pre-deployment validation for alerts. Key components include:

  • Local-first development: Ensuring the same code and inputs run identically on a developer's laptop, in CI, and in production.
  • Change Reports: Initially text-based diffs, evolving into a dedicated UI showing side-by-side alert modifications.
  • Bulk Backtesting: Simulating proposed alerts against historical data using Prometheus's rule manager to predict firing behavior, noisiness, and impact before deployment. This crucial feature allows engineers to understand how alerts would have behaved, answering the most important question: "How will this alert behave in production?"

Architectural Principles and Design Decisions

💡

Compatibility Over Novelty

The platform prioritizes compatibility by directly using Prometheus's standardized rule groups and evaluation engine. This allowed them to leverage existing tools and expose results via standard query APIs, making the system portable and integrating seamlessly into existing developer workflows.

Guardrails for Robustness: Simulating thousands of alerts over extended periods required careful resource management. Each backtest runs in its own Kubernetes pod with autoscaling. Concurrency limits, error thresholds, and multiple circuit breakers prevent cascading failures, ensuring the validation system doesn't destabilize production itself.

ℹ️

Iterative Development and User Experience

The team adopted an "80% solution" approach, shipping immediate value and using the UI to bridge gaps, for example, by prompting users to resolve recording rule dependencies separately. This focus on developer experience, abstracting away low-level Prometheus complexities, was critical for achieving their "zero touch" North Star where product engineers inherit best-practice monitoring automatically.

The impact was significant: a successful migration of 300,000 alerts to Prometheus, collapsed development cycles, and a cultural transformation towards proactively improving alert hygiene rather than tolerating noise.

ObservabilityPrometheusAlertingMonitoringDevOpsKubernetesObservability as CodeBacktesting

Comments

Loading comments...