Inside Stack Overflow’s Monitoring Systems

Nick Craver, architecture lead at Stack Exchange, wrote about their monitoring systems in a recent article. He discussed the philosophy and motivation behind their monitoring strategy and talked about their toolset - mainly Bosun, Grafana and Opserver.

Stack Overflow and its sister sites on Stack Exchange run on .NET and MS SQL Server, IIS web servers, HAProxy (as a load balancer), and additional services provided by Redis and Elasticsearch. Their primary datacenter is at New York with a failover at Oregon. Monitoring at Stack Exchange, Craver notes, generally consists of "logs, metrics, health checks and profiling", and they use Bosun, Opserver, Grafana and MiniProfiler as the primary tools.

The sources of data for Stack Exchange monitoring systems are logs, health checks, and time series metrics. Logging is via both standard mechanisms and custom libraries that push to a database. It also includes Logstash and summarized log events of HTTP requests from the HAProxy load balancers. There are meaningful health checks that actually test what the end user sees, like the home page. Metrics are collected and stored in their custom built, open source Bosun monitoring tool with OpenTSDB as a backend. Bosun also sends alerts and Pagerduty handles escalation management. A tool called Opserver - which shows a dashboard view of the entire monitoring system - completes the picture.

Image courtesy https://nickcraver.com/blog/2016/02/17/stack-overflow-the-architecture-2016-edition/

All Stack Exchange apps use an error logging library called StackExchange.Exceptional which sends the logs to MSSQL Server. This is a fork of a .NET logging library called ELMAH. Redis, Elasticsearch and SQL Server log to their standard logging locations, although it’s not clear if these logs are then sent to a central server for aggregation and search. Logs from network equipment are sent to Logstash and are viewable through a Kibana dashboard. Page load times can be analyzed in detail by using MiniProfiler, which displays the method call timings across the various tiers.

Bosun is a monitoring tool built at Stack Exchange and later open sourced. Bosun's key features are the ability to test alerts against historical data, a query language for time series evaluation, templatized alerts, and alerting and forecasting of time series trends. In contrast to traditional monitoring tools like Nagios, Zabbix etc, and similar to modern ones like Prometheus, Bosun does not require individual alerts to be set up for each server. A single threshold check suffices for the time series that measures, say, CPU usage, across all servers. The alert has the list of time series that violated the threshold, which can be used to identify the problematic servers.

Bosun supports multiple backends for storage, and OpenTSDB (with HBase) is used at Stack Exchange. This is one of their pain points, and since they "don't use HBase anywhere else, the administrative overhead eats up a lot of time", writes Kyle Brandt, one of the original authors of Bosun. Bosun's complementary agent is scollector, which collects metrics from the monitored machines. It is a Go-based replacement of OpenTSDB's tcollector agent. Application metrics are pushed with BosunReporter.

Health checks focus on the end user experience as well as health of internal services. Pingdom checks the externally reachable URLs. End user facing URL checks, like for the homepage, are key because "the home page checks things we may not otherwise check, and a holistic check is important", writes Craver. Fastly acts as a CDN and proxy to the Stack Exchange sites, and its health checks ensure that failover to the secondary datacenter happens when the primary goes down. Apart from server side monitoring, client side timings are also tracked using browser APIs.

Tying all these together are Grafana and Opserver. Grafana plugs-in into Bosun data for displaying time series metrics. Opserver, on the other hand, focuses on overall monitoring status across the infrastructure. Why did the team build Opserver instead of using Nagios or similar tools? Craver explains that no single tool fulfilled all their needs at that time. Like most of their toolset, it evolved out of specific requirements. The Opserver dashboard can be used to drill down into individual services and servers. It needs configuration to be statically provided in JSON format, and will pose some hurdles if used to monitor cloud environments where machines are ephemeral.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Inside Stack Overflow’s Monitoring Systems

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter