Monitoring Microservices and Containers: A Challenge by Adrian Cockcroft

At GlueCon 2015, Adrian Cockcroft presented a list of rules for monitoring microservice and container-based applications. In addition to these guidelines, Cockcroft also highlighted a series of challenges for monitoring cloud-native container-based systems, and introduced his ‘Spigo’ microservice simulation and visualisation tool, which may assist developers with testing microservice monitoring at scale.

Cockcroft, a technology fellow at Battery Ventures, began the talk by presenting a series of microservice and container monitoring rules, which were an update of the rules initially presented at Monitorama 2014:

Spend more time working on code that analyses the meaning of metrics than code that collects, moves, stores and displays metrics
Reduce key business metric latency to less than the human attention span (~10s)
Validate your measurement system has enough accuracy and precision. Collect histograms of response time
Monitoring systems need to be more available and scalable than the systems (and services) being monitored
Optimise for monitoring distributed, ephemeral, 'cloud-native', containerised microservices
Fit metrics to models in order to understand relationships (this is a new rule)

Cockcroft argues that the new rule, ‘fit metrics to models’, is essential because infrastructure, data flow and ownership/organisation structure are often orthogonal and need to be linked in order to make sense of the metrics. After defining microservices as ‘loosely coupled service oriented architectures with bounded contexts’, a series of challenges associated with monitoring both microservices and container technology (such as Docker) were discussed in depth.

The first challenge presented was 'complexity'. Cockcroft stated that monolithic applications have unlimited internal dependencies, and can be vastly more complex than explicitly visible microservice dependencies. Instrumenting all of these external dependencies can be difficult. The second challenge, 'speed of change', presents difficulties with the rate of change that is now available with continuous deployment of container-based microservices:

Measuring CPU usage once a minute makes no sense for containers… Coping with rate of change is a big challenge for monitoring tools.

'Scale' was cited as the third monitoring challenge, which includes more than simply the number of running containers and machines. 'Cloud-native' concepts, such as regions and availability zones must be considered, as should the services themselves and the potential multiple versions running in parallel.

Cockcroft explained that ‘data flow’ is also an inherent challenge with a microservice architecture. Several tools such as Netflix’s Atlas (and associated applications), AppDynamics’ Application Performance Management application, and Twitter’s Zipkin can show the request flow across a few services. However, the interesting architectures have a lot of microservices, and this means that visualisation is a real challenge.

The possibility for 'failure' is an ever-present challenge is microservice applications. This is further compounded for failures that occur in cloud environments. For example, what should a monitoring/analytics platform show if an availability zone partition or failure occurs? By design, a cloud-native application will continue to work with partial availability zone failure, and so this is not necessarily a ‘failure’ per se. However, the system operators should be informed, and potentially application deployments halted.

The challenge here is to understand and communicate common microservice failure patterns.

Testing of microservice and container monitoring tools at scale can quickly become expensive, and according it was suggested that simulation could be a viable alternative. Cockcroft introduced his ‘spigo’ (or ‘simianviz’) microservice simulator, which allows the modelling and visualising of interesting microservice architectures.

Spigo/simianviz is a Go/D3.js-based application that can generate synthetic/test microservice systems. Large scale configurations of systems can be simulated, and the eventual goal is to support the stress testing of real monitoring tooling. Additionally, Cockcroft plans to add support into the tool for dynamically varying code push and autoscaling configurations, modelling of Netflix’s chaos gorilla for zone and region failures, and a WebSocket connection between spigo and simianviz displays.

My challenge to you: Build your architecture in Spigo. Stress monitoring tools with it. Help fix monitoring for microservices!

More information and the slides for Adrian Cockcroft’s GlueCon talk can be found on Cockcroft’s SlideShare account. The code for Spigo/simianviz can be found on Github. GlueCon is an annual developer conference that focuses on cloud, DevOps, mobile, APIs and big data, and more information can be found on the GlueCon website.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter