BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry

Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry

Listen to this article -  0:00

Airbnb's observability engineering team has published details of a large-scale migration away from StatsD and a proprietary Veneur-based aggregation pipeline toward a modern, open-source metrics stack built on OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The resulting system now ingests over 100 million samples per second in production.

Rather than performing a phased feature migration, the team chose to front-load collection: get all metrics flowing into the new system first, then address dashboards, alerts, and user-facing tooling with real data already available. The primary challenge was bridging three coexisting instrumentation worlds: StatsD libraries, growing OTLP adoption, and a new Prometheus-compatible storage backend based on Grafana Mimir.

Roughly 40% of Airbnb services use a shared, platform-maintained metrics library. The observability team updated this library to dual-emit metrics: sending StatsD to the legacy pipeline while simultaneously emitting OTLP to the new OpenTelemetry Collector, enabling broad migration progress with minimal friction across the service fleet.

The move to OTLP brought measurable gains: CPU time spent on metrics processing in JVM services dropped from 10% to under 1% of total CPU samples. Beyond performance, OTLP's use of TCP eliminated packet loss risks inherent in StatsD's UDP transport, and the protocol's native support for Prometheus exponential histograms removed the need for an intermediate translation layer inside the collector. However, the highest-cardinality services, emitting upward of 10,000 samples per second per instance, experienced memory pressure and increased garbage collection after enabling OTLP. The fix was switching those specific services to delta temporality via AggregationTemporalitySelector.deltaPreferred(), which avoids retaining full state of all metric-label combinations between exports. The trade-off accepted was that unexpected failures would now surface as visible data gaps rather than anomalous jumps.

Airbnb's previous pipeline aggregated away instance-level labels (such as pod and hostname) using an internal fork of Veneur before forwarding to its metrics vendor. The new Prometheus-based stack required a comparable aggregation step to keep storage costs manageable.

The team evaluated several alternative solutions: continuing with Veneur would have required a significant rewrite to support the Prometheus data model. Recording rules were ruled out because they require storing raw data in the TSDB before aggregation, defeating the cost-saving goal. The OpenTelemetry Collector lacked native metric aggregation support despite open proposals. Vector was eliminated due to the absence of built-in scaling and limited Rust adoption at Airbnb. m3aggregator was deemed architecturally more complex than necessary.

In the end, vmagent was selected for its built-in streaming aggregation for Prometheus metrics, horizontal sharding support, approachable documentation, and a small codebase of roughly 10,000 lines that made internal customization tractable.

The production architecture uses two layers of vmagent: stateless router pods that shard metrics by consistently hashing all labels except the ones to be aggregated away, and stateful aggregator pods that maintain running totals in memory. Routers are configured with a static list of aggregator hostnames, leveraging Kubernetes StatefulSet stable network identities to avoid additional service discovery dependencies. The production cluster scaled to hundreds of aggregators. Several generic improvements were contributed back to the VictoriaMetrics upstream project.

After completing the collection migration, the team observed that PromQL queries over certain counters consistently underreported compared to the legacy vendor. The root cause was an edge case in how Prometheus handles counter resets at low emission rates. In StatsD, each data point represents a delta relative to the flush window. In Prometheus, data points represent cumulative counts, and the rate() function derives deltas — but if a counter increments once and its pod restarts before it can increment again, that increment is lost before rate() can compute a meaningful delta.

At Airbnb, this edge case proved more impactful than anticipated. Many counters track high-dimensional, low-frequency events, requests per currency per user per region, for example, where any given label combination may increment only a few times per day. These were frequently business-critical metrics, and their systematic undercounting stalled user migration.

The team rejected pre-initializing all counters to zero (impractical at scale with unpredictable label combinations), replacing counters with gauges (counter-conventional in Prometheus), and pushing workaround PromQL onto every dashboard and alert author. The chosen solution was a transparent "zero injection" technique implemented inside the vmagent aggregation tier: the first time an aggregated counter is flushed, the aggregator emits a synthetic zero rather than the actual running total. The real accumulated value is flushed on the subsequent interval. This implicitly initializes every counter to zero, matching Prometheus semantics, while the delayed first flush ensures the synthetic zero's timestamp does not collide with any prior samples. The trade-off is a single flush-interval lag on the first recorded increment.

The completed pipeline processes over 100 million samples per second in a single production cluster, with cost reduced by roughly an order of magnitude compared to the previous vendor-based architecture. The centralized aggregation tier also became a general-purpose transformation layer: operators can drop problematic metrics caused by bad instrumentation changes without touching application code, or temporarily dual-emit raw metrics for debugging purposes.

Flipkart and Shopify tackled versions of the same underlying problem from different angles. Flipkart's engineers faced 80 million simultaneous time-series from roughly 2,000 API Gateway instances, a scale where StatsD had already collapsed under the weight of long-range queries, and solved the aggregation problem using hierarchical Prometheus federation, local servers strip high-cardinality instance labels via recording rules before exposing summarised series upward through /federate endpoints. Shopify's motivation was primarily financial: three separate vendors for metrics, logs, and traces had become ruinously expensive as the platform scaled, and the team rebuilt from scratch onto Prometheus, Loki, Tempo, and Grafana under a single in-house platform called Observe, using the same dual-write pattern Airbnb would later employ to validate parity before cutting over. Neither post goes as deep as Airbnb's on the specific mechanics of aggregation or the correctness hazards of counter semantics. Together, they confirm that the StatsD exit, the vendor cost reckoning, and the move toward a Prometheus-anchored open-source stack are not idiosyncratic choices. They are a pattern playing out across hyperscale engineering organisations wherever the legacy push-based monitoring model has hit its ceiling.

About the Author

Rate this Article

Adoption
Style

BT