Airbnb successfully migrated from vendor-managed observability to an in-house, open-source platform built on Prometheus, driven by rising costs and a lack of control. This case study details their hard-won lessons, emphasizing strategic migration approaches, data consistency through metadata engines, and enhancing developer experience with improved alerting and query tooling.
Read original on Airbnb EngineeringThe article discusses Airbnb's significant undertaking to transition its observability infrastructure from third-party vendors to an internal platform. This move was primarily motivated by escalating costs associated with data ingestion from vendor solutions and the desire to regain control over their observability stack for better customization, cost optimization, and improved developer workflows. The new system is built on open-source technologies, notably Prometheus, allowing end-to-end ownership of metrics collection, storage, and querying.
Airbnb experimented with two distinct migration strategies. The initial approach (v1) involved tackling the most complex services first, aiming for a high-visibility win. This proved challenging, leading to significant friction, false alarms, and misaligned dashboards. The successful strategy (v2) advocated starting with a tractable, high-leverage service that closely aligned with the destination system. This allowed the team to validate the storage engine at scale, build translator tooling, invest in documentation, and gather crucial UX feedback before a wider rollout.
Key Migration Learning
When undertaking a large-scale system migration, prioritize proving the migration's technical and operational feasibility with a simpler, yet impactful, initial target. This builds confidence and provides valuable feedback before scaling the effort.
A crucial lesson was to migrate the *intent* of queries rather than a literal one-to-one translation. Existing systems often accumulate flawed or inconsistent queries (e.g., averaging p95 metrics). To address this, Airbnb built a dedicated metadata engine into their translation layer. This engine periodically scans metrics, using an internal label (`otel_metric_type`) to create a reliable mapping from metric to its type. This ensures that even with preserved metric names, queries are standardized (e.g., always returning a canonical histogram query for a p95 metric), leading to accurate and consistent visibility.