Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Metrics Collection at Scale: Learning from Uber's M3

Metrics Collection at Scale: Learning from Uber's M3

This item in japanese

In a recent InfoQ podcast, Rob Skillington, co-founder and CTO at Chronosphere, shared his experience and opinions on the topic of observability in modern distributed systems. Key topics covered: metrics collection at scale, which included insight into the design and operation of Uber’s M3 and M3DB metrics tooling; multi-dimensional metrics and high-cardinality; the importance of the developer experience with observability tooling; and how open standards, such as OpenMetrics, help developers and platform engineers implementing observability tooling.

Over the past ten years, the requirements related to monitoring and alerting, and the approach taken to implement this, have changed considerably. Engineers want to instrument more things, and generate more insight for themselves and others in the organisation. In addition, compute is now ephemeral and dynamic, and services are more numerous.

A core challenge with metric data is the limited information for providing context for collected values. This can be solved by using multi-dimensional metrics. Dimensions of a metric are name-value pairs that carry additional data to describe the metric value. High dimensionality can lead to high cardinality.

Another important challenge is implementing a scalable monitoring solution that is fit for purpose. Skillington reminisced about his early days working at Uber, where the team responsible for metrics collection initially used tooling such as Nagios and Graphite.

When I was at Uber, we ran Nagios at the beginning [....] It worked quite effectively in the early days and especially when we were on physical hosts, a lot of things mapped very natively to the world's Nagios. As we scaled up and started to use compute frameworks, such as Mesos, and the rest of the world mainly using Kubernetes, none of the concepts really mapped to it.

This led to the creation of Uber’s M3 metrics collection system. This system initially used open source components such as Cassandra and ElasticSearch for storage and indexing. As the scale of usage of M3 increased, these OSS components were gradually replaced by custom components, such as M3DB and other utilities.

Around the second Halloween at Uber, the amount of Graphite queries coming in just completely overwhelmed the Python Stack we had, and just a huge amount of Python servers comparative to the storage servers we were running so at that point we rewrote the Graphite query language into Go.

As the M3 tooling evolved, Skillington and his fellow Chronosphere co-founder Martin Mao, also learned about the value of creating an effective user experience for engineers working with this observability tooling.

We had a thousand unique visitors to our internal Grafana every day, which was backed by M3. So more than half of the engineering team was using this tool daily. So it had to be fast, had to be scalable, had to be easy to use, but we didn't focus as much on the actual user interfaces that the engineers were using.

Building an effective user experience for operational tooling is vitally important. Engineers will be relying on these tools for both alerting and being able to locate and understand what is occurring during production issues. This is where the Chronosphere team seeks to add value with their commercially supported products.

Skillington concluded the podcast by arguing that open standards are vitally important for interoperability and for driving collaboration and innovation appropriately. The OpenMetrics project is an effort to create an open standard for transmitting metrics at scale, with support for both text representation and protocol buffers.

The podcast audio and full transcript can be found in the article, “Rob Skillington on Metrics Collection, Uber’s M3, and OpenMetrics

Rate this Article