The Robinhood server operations team published a series of articles detailing their metrics collection, monitoring and alerting infrastructure. OpenTSDB, Grafana, Kafka and Riemann form the core of the stack, with Kafka acting as a proxy layer from which the data is pushed into Riemann for stream processing of the metrics and into OpenTSDB for storage.
The tech stack at Robinhood is mostly Python with some Golang. There is a lot of dependency on metrics for debugging and monitoring of the production servers. Metrics collection is primarily done using the OpenTSDB metrics collection database, which comes with many standard collectors (called tcollectors) for various software stacks. OpenTSDB also supports plugging in custom collectors which can gather metrics and push them into OpenTSDB using its telnet or HTTP endpoints. In Robinhood’s case, the metrics data is first sent into a Kafka proxy.
On each server, standard and custom tcollectors send data to Kafka. Applications are instrumented using statsd libraries, which send application metrics to a local statsd process running on each server box. The statsd server implementation used is statsite, written in C. A custom adapter converts the statsd metrics to local tcollector metrics, which are then sent to OpenTSDB via Kafka. The tcollector process outputs metrics to stdout that are pushed to Kafka using a Python script.
Since OpenTSDB forms the backbone of the metrics collection system, it needs to be highly available. InfoQ got in touch with Aravind Gottipati, operations engineer at Robinhood, to find out more:
Robinhood runs multiple independent OpenTSDB instances, each of which independently consumes the same metric stream from Kafka. This allows us to load balance requests (since they are all the same) to any of these OpenTSDB instances. This also give us easy HA. We don't run a full HBase cluster - instead each of these instances runs a local single node HBase server (+ master).
Kafka is used as an intermediary so that multiple consumers can process data in different ways, one of them being to transform and push metrics into OpenTSDB. It also enables handling of increasing volumes of data by sharding it across multiple OpenTSDB servers if needed. Using Kafka as a proxy also allows for pausing and resuming the consumers for maintenance. The bridge between Kafka and OpenTSDB is a console-based consumer which prints to stdout. This is pushed to an OpenTSDB telnet listener using netcat.
Grafana, a visualization tool which supports Graphite, InfluxDB and OpenTSDB backends, is used for viewing metrics. CloudWatch metrics can also be plugged in into the same dashboard.
Riemann forms a key part of the monitoring and alerting workflow at Robinhood. Traditional alerting systems like Sensu - which is also used at Robinhood - depend on point in time views of metrics. They are not amenable to historical views for various reasons, including difficulties in writing such queries as well as the high latencies involved in running them. Some metrics systems might also not support historical records, and interpolation has to suffice for missing data points. OpenTSDB does do well in some of these, so why Riemann? "OpenTSDB depends on HBase and in general excels at pulling an entire range of the keyspace out from HBase. Pulling out a single metric point at an exact timestamp is not its forte. Using it for alerting would mean exactly that - pulling out and examining metric points at some exact timestamps of your choosing. While the queries are usually pretty expedient - it still has to scan through an entire key range just to extract single data points", says Gottipati.
Stream processing involves defining rules and filters and then streaming the data through them. An alert is triggered when a filter or a rule matches. Riemann can aggregate metrics streams from various sources and run them against a stream processing language. The entire metrics collection system ties in into Riemann using a Kafka consumer which pushes data into Riemann. The naming convention for metrics is influenced by OpenTSDB, where each metric has a type, host and role associated with each event as key-value pair tags. netcat is used here also to push data into Riemann, where the roles (e.g. webserver, db) that are tagged into each event by the originating tcollector are converted into Riemann tags. This makes it easy to use the built-in filter functions in Riemann. A wrapper DSL over Riemann primitives developed in-house helps developers to use it easily. Such systems are key to DevOps collaboration, so what were the key milestones in instituting a DevOps culture at Robinhood? Gottipati responded:
We built some sample dashboards that helped highlight the different system metrics we were collecting, as well as application metrics dashboards (from statsd metrics). We started using and sharing these dashboards in response to a variety of troubleshooting requests and thus got the senior folks to start using them. After a little while, we helped the same folks to put together more application specific dashboards and this got the process going. The backend/application teams build and maintain a variety of dashboards - some of which operations folks don't even know about. They train the incoming engineers on looking up and using these dashboards.
For viewing events in Riemann, an Elasticsearch (ELS) instance is used instead of the default Riemann dashboard. Around 50% of the events from Kafka are pushed into ELS, with a peak of around 20,000 events / second.