Atlas: Netflix's Primary Telemetry Platform

Netflix has open sourced Atlas, part of their next-generation monitoring platform they have been working on since early 2012. The company developed Atlas to store time series data in order to provide near real-time operational insight to teams. It features in-memory data storage allowing it to gather, store, and report very large numbers of metrics at scale. Roy Rapoport speaks to how large scale it is, saying, "We routinely see Atlas fetch and graph many billions of datapoints per second."

Atlas replaced their previous solution, a combination of an in-house tool called Epic and a commercial product. They built it to cope with ever-increasing volumes of data the company generates. In 2011, Netflix was monitoring two million metrics related to their streaming service. By 2014, this number had jumped to 1.2 billion metrics.

The team focused on making the correct trade-offs for their goal when designing Atlas - which was giving operational insight across Netflix. They define operational insight as allowing people to determine what's going on right now, as well as historically. This culminated in the following rules about operational data:

Data becomes exponentially less important as it gets older
Restoring service is more important than preventing data loss
Try to degrade gracefully

To keep things fast, Netflix chose to store most data in memory. The team evaluated a variety of backends, but they ended up choosing to keep most data available for querying in memory, either in or off the JVM heap. This result is a high-performance, but operationally-expensive, system. However, they have done a few things to mitigate this. First, it controlled the number of data replicas of data. The company is mostly using replicas for redundancy only, rather than increasing query performance. Older data, which is stored persistently, is run using only a single replica, cutting costs by 50% or more. The second strategy involves automatic rollup; this drops excess data on older, less important metrics. Anything older than six hours is subject to rollup, dropping things like node-dimensions to save storage space.

To give an example of why automatic rollup is important, as well how dimensional metrics can explode in size, consider the following example from Roy:

Consider a simple graph showing the number of requests per second hitting a service for the last three hours. Assuming minute resolution that is 180 datapoints for the final output. On a typical service we would get one time series per node showing the number of requests so if we have 100 nodes the intermediate result set is around 18k datapoints. For one service users went hog-wild with dimensions breaking down requests by device (~1000s) and country (~50) leading to about 50k time series per node. If we still assume 100 nodes that is about 900M datapoints for the same three hour line.

Atlas is only the first part of an wider number of products which Netflix intends on open sourcing. They've also built:

User interfaces
- Main UI for browsing data and constructing queries.
- Dashboards
- Alerts
Platform
- Inline aggregation of reported data before storage layer
- Storage options using off-heap memory and lucene
- Percentile backend
- Publish and persistence applications
- EMR processing for computing rollups and analysis
- Poller for SNMP, healthchecks, etc
Client
- Supports integrating Servo with Atlas
- Local rollups and alerting
Real-Time Analytics
- Metrics volume report
- Automated Canary Analysis
- Outlier and anomaly detection
- Automated server culling based on outlier characteristics

For a more detailed overview, please see their wiki here.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Cloud Computing topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter