Airbnb Engineering·March 17, 2026

Migrating to an In-House Observability Platform: Lessons from Airbnb

Airbnb successfully migrated from vendor-managed observability to an in-house, open-source platform built on Prometheus, driven by rising costs and a lack of control. This case study details their hard-won lessons, emphasizing strategic migration approaches, data consistency through metadata engines, and enhancing developer experience with improved alerting and query tooling.

DevOps & SRE Performance & Scaling Distributed Systems

Read original on Airbnb Engineering

The article discusses Airbnb's significant undertaking to transition its observability infrastructure from third-party vendors to an internal platform. This move was primarily motivated by escalating costs associated with data ingestion from vendor solutions and the desire to regain control over their observability stack for better customization, cost optimization, and improved developer workflows. The new system is built on open-source technologies, notably Prometheus, allowing end-to-end ownership of metrics collection, storage, and querying.

Challenges of Vendor Observability

Cost Escalation: Vendor pricing models often charge by data volume, leading to rapidly increasing costs as an organization scales.
Lack of Customization: Limited control over how data is consumed and inability to integrate tightly with internal workflows.
Hindered Optimization: Difficulty in pursuing cost optimizations or enhancing monitoring features due to being outside the feedback loop.

Strategic Migration Approaches

Airbnb experimented with two distinct migration strategies. The initial approach (v1) involved tackling the most complex services first, aiming for a high-visibility win. This proved challenging, leading to significant friction, false alarms, and misaligned dashboards. The successful strategy (v2) advocated starting with a tractable, high-leverage service that closely aligned with the destination system. This allowed the team to validate the storage engine at scale, build translator tooling, invest in documentation, and gather crucial UX feedback before a wider rollout.

💡

Key Migration Learning

When undertaking a large-scale system migration, prioritize proving the migration's technical and operational feasibility with a simpler, yet impactful, initial target. This builds confidence and provides valuable feedback before scaling the effort.

Migrating the Intent of Query, Not Just Syntax

A crucial lesson was to migrate the *intent* of queries rather than a literal one-to-one translation. Existing systems often accumulate flawed or inconsistent queries (e.g., averaging p95 metrics). To address this, Airbnb built a dedicated metadata engine into their translation layer. This engine periodically scans metrics, using an internal label (`otel_metric_type`) to create a reliable mapping from metric to its type. This ensures that even with preserved metric names, queries are standardized (e.g., always returning a canonical histogram query for a p95 metric), leading to accurate and consistent visibility.

Improving Developer Experience Post-Migration

PromQL Adoption: Embracing PromQL as the new query language leverages its mature ecosystem and broad understanding, further augmented by AI tooling for semantic metadata and automated query generation.
Enhanced Alerting Framework: Replacing an outdated alert system with a new, code-driven workflow (including autocomplete, backtesting, and diffing) significantly improved productivity and made alerts a development workflow rather than a static config. This shift in alert management provided immediate, tangible value to developers, amplifying the perceived success of the migration beyond just cost savings.

ObservabilityMetricsPrometheusMigrationIn-house platformDistributed TracingMonitoringSRE

Comments

Loading comments...

Architecture Design

Design this yourself

Design an observability platform for a large-scale, high-growth company like Airbnb, supporting hundreds of millions of timeseries and thousands of services. Focus on enabling end-to-end ownership of metrics collection, storage, querying, and alerting, while minimizing costs and improving developer experience. Detail the architectural choices for data ingestion, storage (e.g., time-series database), querying, and an advanced alert management system with code-driven workflows and intent-based query translation.

Practice Interview

Other design angles

· Design a distributed metrics collection agent that can seamlessly integrate with existing services and forward data to a Prometheus-compatible backend, ensuring high availability and low latency.· Architect a metadata service for an observability platform that intelligently infers metric types and properties, standardizes query intent, and supports AI-driven query generation for diverse data sources.· Design an advanced alerting and incident management system that integrates with a metrics platform, providing code-based alert definitions, backtesting capabilities, and automated incident routing for a large engineering organization.