Avoiding Alerts Overload from Microservices: Sarah Wells at QCon London
At QCon London, Sarah Wells presented "Avoiding Alerts Overload from Microservices", and cautioned that developers and operators must fundamentally change the way they think about monitoring when building a distributed microservice-based system. Key takeaways included: build a system that can be supported; focus on monitoring 'stuff that matters', such as core user journeys and business functionality, when creating monitoring and alerts; and continually and proactively cultivate and improve alerts.
Wells, a Principal Engineer at Financial Times, began the talk by stating that knowing when there is a problem is not enough, an alert must only be triggered when an action by a human is required. A microservices architecture may allow the development team to move fast, but there is an operational cost, and the number (and complexity) of alerts generated by a microservice-based system can be overwhelming.
The FT.com website is powered by a microservice backend, primarily utilising the Java and Go programming languages, packaged and deployed with Docker and CoreOS onto the Amazon Web Services (AWS) platform. Data stores included mongoDB, elastic, neo4j and Apache Kafka. There are 99 functional services (with 350 running instances at any given time), and 52 non functional services (with 218 running instances). Wells stated that if each of the 568 service instances were checked every minute, this would result in 817,920 checks per day. Running containers on shared Virtual Machines (VMs) requires 92,160 system-level checks, for a total of 910,080 checks per day. In addition, any microservice-based application is a distributed systems, and accordingly services do not run independently. If something fails, it can often lead to cascade failures, which further complicates monitoring and alerting.
With a microservice-based application, which is inherently a distributed system, you have to change how you think about monitoring
In order to adapt to the challenges of monitoring a microservices-based application, Wells suggested a three-pronged approach: build a system that can be supported; concentrate on "stuff that matters"; and cultivate alerts and the information they contain.
In order to build a system that can be supported, the following basic tools are required:
- Log aggregation - due to the volume of services and potential latency introduced via communication over a network, logs may go missing or get increasingly delayed. This means that log-based alerts may miss issues (particularly time sensitive issues). Effective log aggregation requires a method to find all related logs, and accordingly the FT team use transaction id for correlation.
- Monitoring - traditional tooling like Nagios is often limited, in that it does not provide a 'service-level' view, and the default (infrastructure) checks include things that can not be fixed. In a microservices-based system, monitoring should be at the service and VM level. Monitoring needs to be aggregated and made visual, and the FT technical team utilise a custom framework named SAWS (built by Silvano Dossan) and Dashing. There is also extensive use of graphing via Graphite and Grafana.
FT.com microservices alert dashboard, which is powered by the dashing.io framework
When developing polyglot services, logging and monitoring integration must be made easy for any language that is used. The expectations, or operational contract, must be specified, and each service owner is responsible for implementing functionality to meet this requirement. For example, the FT healthcheck standard requires that every service expose a healthcheck endpoint over HTTP, 'http://service/__health', which returns a 200 if the service can run the healthcheck, and a JSON document containing multiple checks that can contain additional information but must return '"ok":true' or '"ok": false'.
A core goal of monitoring and alerting is to know about problems before clients do, and accordingly the practice of running 'synthetic requests' that mimic user functionality behaviour is vital. If functionality relating to a key user journey is broken, for example, an FT editor cannot publish a new article, then this must be fixed immediately i.e. "concentrate on the stuff that matters". The FT technical team have also created dashboards showing core client statistics, such as number of errors, and response latency, but Wells stressed that it is "the end-to-end [business functionality] that matters".
A microservices architecture lets you move fast, but there is an associated operational cost. Make sure it's a cost you're willing to pay
Alerts must continually be cultivated, and if an alert is received that doesn't make sense, or does not require human interaction, it must be corrected or removed. If an issue occurs, and there was no alert, then one should be added as part of the fix. Key information must be included within each alert, for example, an overview of the business impact, the associated run book location, and corresponding transaction ids that triggered the issue. The FT team use dedicated 'Ops Cops' (on-call members of the development team, rotated regularly) to watch for issues with monitoring, and have integrated alerting within the team's Slack messaging system. A pre-defined list of emojis are used to indicate when and how an issue is being managed and resolved.
Concluding the talk, Wells suggested that creating alerts should be part of the normal development workflow "code, test, alerts". In order to ensure that the development team know if an alert stops working, tests should be added to validate the alert. The FT technical team subscribe to the philosophy of chaos testing, and inspired by Netflix's Simian Army and Chaos Monkey, they have created a 'Chaos Snail' (which is "smaller than a monkey, and written in Bash shell"!). Wells cautioned that proactivity is required when maintaining and dealing with alerts in a non-trivial system, and out of date information can be worse than none at all. Automate updates wherever possible, and find ways to share what is changing.
The slides for Sarah Wells QCon London talk, "Avoiding Alerts Overload From Microservices" can be found on Speaker Deck. The public availability schedule of the conference talk recordings can be found on the QCon London website.