This article details Airbnb's migration of a large-scale metrics pipeline from StatsD to OpenTelemetry and a Prometheus-based backend. It covers the architectural decisions, dual-write strategy, performance benefits of OTLP, the introduction of a streaming aggregation layer using vmagent for cost control and scalability, and a novel 'zero injection' solution for sparse counter accuracy issues.
Read original on Airbnb EngineeringMigrating a production-grade, high-volume metrics pipeline requires careful architectural planning, especially when transitioning between different instrumentation protocols and storage backends. Airbnb faced the challenge of moving from a StatsD-based system to one leveraging OpenTelemetry Protocol (OTLP) for instrumentation and Prometheus for storage, while ensuring scalability, correctness, and cost-efficiency.
The initial architecture used StatsD with Veneur sidecars. The migration involved adopting OTLP as the recommended protocol for internal services and Prometheus for OSS workloads, with StatsD remaining for legacy applications. A dual-write approach was critical: a shared metrics library was updated to emit metrics simultaneously in both StatsD (for the old pipeline) and OTLP (for the new OpenTelemetry Collector). This allowed for a broad, low-friction rollout and validation against real data.
Benefits of OTLP over StatsD
Adopting OTLP yielded significant improvements: - Reduced CPU overhead: JVM profiling showed CPU time for metrics processing dropped from 10% to less than 1%. - Improved reliability: OTLP uses more reliable transport compared to StatsD's UDP. - Simplified pipeline: Eliminated the need for StatsD-to-OTLP translation. - Prometheus-native features: Enabled use of exponential histograms with full fidelity. - Future-proofing: OTLP is a vendor-neutral, CNCF standard, positioning for future observability use cases.
A key challenge with OTLP involved performance regressions in high-volume services due to high-cardinality metrics, leading to memory pressure and increased GC activity. This was mitigated by configuring delta temporality for these specific services, which reduces in-process memory burden by not retaining full state between exports. The trade-off is that unexpected failures can cause visible data gaps.
To manage costs and scale, a streaming aggregation layer was essential. After evaluating and rejecting options like continuing with a custom Veneur fork, Prometheus recording rules (due to storing raw data first), m3aggregator (complexity), and OTel Collector/Vector (lack of scaling/features), Airbnb chose vmagent from VictoriaMetrics.
The vmagent architecture consists of two layers: routers (stateless, shard metrics by consistent hashing of labels) and aggregators (perform aggregation and maintain in-memory state). This scalable design allowed Airbnb to ingest over 100 million samples per second, reducing costs significantly and providing a centralized point for metric transformations.
A critical issue emerged with sparse counters in the Prometheus pipeline: queries using `rate()` or `increase()` consistently undercounted compared to the StatsD system. This occurred because Prometheus's cumulative counters, when reset (e.g., due to pod restarts) before incrementing again, lose the increment before `rate()` can derive a delta. This was particularly problematic for low-rate, high-dimensionality counters.
The Zero Injection Solution
Airbnb implemented a transparent 'zero injection' solution within their vmagent aggregation tier. When an aggregated count is flushed for the first time, a synthetic zero is flushed instead of the actual running total. This implicitly initializes all counters to zero, aligning with Prometheus's semantics and allowing `rate()` to function correctly. The delayed flush ensures timestamp consistency. This elegant fix was invisible to end-users and solved the problem at the source, rather than requiring complex client-side workarounds.