Airbnb Engineering·April 7, 2026

Migrating a High-Volume Metrics Pipeline to OpenTelemetry and Prometheus

This article details Airbnb's migration of a large-scale metrics pipeline from StatsD to OpenTelemetry and a Prometheus-based backend. It covers the architectural decisions, dual-write strategy, performance benefits of OTLP, the introduction of a streaming aggregation layer using vmagent for cost control and scalability, and a novel 'zero injection' solution for sparse counter accuracy issues.

Performance & Scaling Distributed Systems DevOps & SRE

Read original on Airbnb Engineering

Migrating a production-grade, high-volume metrics pipeline requires careful architectural planning, especially when transitioning between different instrumentation protocols and storage backends. Airbnb faced the challenge of moving from a StatsD-based system to one leveraging OpenTelemetry Protocol (OTLP) for instrumentation and Prometheus for storage, while ensuring scalability, correctness, and cost-efficiency.

Instrumentation and Dual-Write Migration Strategy

The initial architecture used StatsD with Veneur sidecars. The migration involved adopting OTLP as the recommended protocol for internal services and Prometheus for OSS workloads, with StatsD remaining for legacy applications. A dual-write approach was critical: a shared metrics library was updated to emit metrics simultaneously in both StatsD (for the old pipeline) and OTLP (for the new OpenTelemetry Collector). This allowed for a broad, low-friction rollout and validation against real data.

ℹ️

Benefits of OTLP over StatsD

Adopting OTLP yielded significant improvements: - Reduced CPU overhead: JVM profiling showed CPU time for metrics processing dropped from 10% to less than 1%. - Improved reliability: OTLP uses more reliable transport compared to StatsD's UDP. - Simplified pipeline: Eliminated the need for StatsD-to-OTLP translation. - Prometheus-native features: Enabled use of exponential histograms with full fidelity. - Future-proofing: OTLP is a vendor-neutral, CNCF standard, positioning for future observability use cases.

A key challenge with OTLP involved performance regressions in high-volume services due to high-cardinality metrics, leading to memory pressure and increased GC activity. This was mitigated by configuring delta temporality for these specific services, which reduces in-process memory burden by not retaining full state between exports. The trade-off is that unexpected failures can cause visible data gaps.

Streaming Aggregation with vmagent

To manage costs and scale, a streaming aggregation layer was essential. After evaluating and rejecting options like continuing with a custom Veneur fork, Prometheus recording rules (due to storing raw data first), m3aggregator (complexity), and OTel Collector/Vector (lack of scaling/features), Airbnb chose vmagent from VictoriaMetrics.

vmagent supports streaming aggregation for Prometheus metrics.
It offers sharding capabilities for horizontal scaling.
Features a small codebase for ease of understanding and modification.
Documentation is user-friendly and setup is straightforward.

The vmagent architecture consists of two layers: routers (stateless, shard metrics by consistent hashing of labels) and aggregators (perform aggregation and maintain in-memory state). This scalable design allowed Airbnb to ingest over 100 million samples per second, reducing costs significantly and providing a centralized point for metric transformations.

Solving Sparse Counter Undercounting with Zero Injection

A critical issue emerged with sparse counters in the Prometheus pipeline: queries using `rate()` or `increase()` consistently undercounted compared to the StatsD system. This occurred because Prometheus's cumulative counters, when reset (e.g., due to pod restarts) before incrementing again, lose the increment before `rate()` can derive a delta. This was particularly problematic for low-rate, high-dimensionality counters.

💡

The Zero Injection Solution

Airbnb implemented a transparent 'zero injection' solution within their vmagent aggregation tier. When an aggregated count is flushed for the first time, a synthetic zero is flushed instead of the actual running total. This implicitly initializes all counters to zero, aligning with Prometheus's semantics and allowing `rate()` to function correctly. The delayed flush ensures timestamp consistency. This elegant fix was invisible to end-users and solved the problem at the source, rather than requiring complex client-side workarounds.

OpenTelemetryPrometheusMetrics PipelinevmagentStatsDObservabilityDistributed TracingData Aggregation

Comments

Loading comments...

Architecture Design

Design this yourself

Design a high-volume, cost-effective metrics pipeline capable of handling 100M+ samples/second, migrating from a legacy StatsD system to OpenTelemetry and Prometheus. The system should support dual-write instrumentation, ensure data accuracy for sparse counters with high cardinality, and include a scalable streaming aggregation layer for cost control.

Practice Interview

Other design angles

· Design only the streaming aggregation layer for a Prometheus-based metrics pipeline, focusing on sharding, fault tolerance, and custom transformation capabilities.· Design a solution for ensuring accurate metrics for sparse, high-cardinality counters in a distributed monitoring system, considering trade-offs between cumulative vs. delta temporality and different injection strategies.· Design a strategy for migrating a large organization's services from a legacy metrics protocol (e.g., StatsD) to a modern standard like OpenTelemetry, focusing on minimizing friction and validating data integrity during the transition.