Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Scaling Observability at Uber: Building In-House Solutions, uMonitor and Neris

Scaling Observability at Uber: Building In-House Solutions, uMonitor and Neris

Leia em Português

This item in japanese

Uber's infrastructure consists of thousands of microservices supporting mobile applications, infrastructure, and internal services. To provide high observability of these services, Uber's Observability team built two in-house monitoring solutions: uMonitor for time-series metrics-based alerting, and Neris for host-level checks and metrics. Both systems leverage a common pipeline for altering and deduplication.

According to Shreyas Srivatsan, senior software engineer on the Observability team, Uber quickly outgrew their initial monitoring and alerting platform as their business scaled. Originally they were leveraging Nagios, sending Graphite threshold checks against Carbon metrics using source controlled scripts. However, Nagios required code to be written and deployed for each metric check, which was not scalable as teams and products grew. Around the same time, they began having scalability issues with their Carbon cluster. This led the Observability team to create their own metrics database, M3.

To process the metrics in M3, the team built uMonitor, a time-series metrics-based alerting system. At the time of posting, uMonitor has 125,000 alert configurations that check 700 million data points over 1.4 million time series each second. An alert configuration in uMonitor consists of a query (either Graphite or M3QL) and a threshold. The query returns one or more timeseries from M3 and the thresholds are applied to each of the series. Alerts are then triggered if the query violates any of the thresholds.

uMonitor consists of three separate components: a storage service with an alert management API, a scheduler, and workers. The storage service wraps a Cassandra database that stores the alert definitions and the state machine for the alerts. The scheduler keeps track of the alerts and dispatches alert check approximately every minute. The workers execute the alert checks against the underlying metrics while maintaining their state in Cassandra to ensure they are not over-alerting.

Architecture for Uber's uMonitor tool

uMonitor architecture (credit: Uber)


Standard metrics such as endpoint errors or CPU/memory consumption are generated automatically by uMonitor. Other alerts can be created manually as determined by each team. Currently uMonitor supports two types of thresholds for its alerts: static and anomaly. Static thresholds are useful for steady state metrics such as queries that return consistent values. Anomaly thresholds are supported through Argos, Uber's anomaly detection platform. This system generates dynamic thresholds based on historical data.

Uber maintains a second system, Neris, for tracking host metrics that are not available in their M3 metrics system. According to Srivatsan they found it inefficient to store their "1.5 million host metrics generated per minute across 40,000 hosts per data center" in a centralized database and instead opted to query the hosts directly. Neris has one agent running on each host to execute alert checks on that host. At startup, the agent pulls its configuration from Uber's central config store known as Object Config. The configuration will dictate the role of the host which in turn sets up the appropriate checks. The agent sends the results of those checks to an aggregation tier which then sends the data along to Origami. Origami is responsible for deciding which alerts should be sent out based on a series of rules along with deduplication of the alerts.

Srivatsan states that "high cardinality has always been the biggest challenge for our alerting platform." As Aaron Sun writes, "cardinality in the context of monitoring systems is defined as the number of unique metric time series stored in your system's time series database." Originally, Uber handled their high cardinality by having alert queries return multiple series and having rules that trigger only if enough series crossed a threshold. This worked well with queries that returned a bounded number of series with well-defined dependencies. However, once teams started writing queries to alert on a per city, per product, and per app version to support their new product lines, the queries no longer fit this constraint.

The team began leveraging Origami to help with these more complicated queries. As noted above, Origami is capable of deduplication and rollup of alerts. It is also capable of creating alerts on combinations of city, product, and app version which are then triggered on aggregate policies. These alert definitions are stored in the individual team's git repositories and then synced to Object Config.

As their platform evolves, the alerting has progressed from simply notifying the on-call engineer to automatically triggering rollback and other mitigation activities. uMonitor provides full support for rollbacks, but can also POST to a route to trigger more complex mitigation strategies. Further roadmap improvements include more efficient metric collection, streamlining the alert execution infrastructure, and creating UIs to facilitate correlating data across multiple sources.

Rate this Article