The Future of Monitoring: an Interoperable Architecture

Jason Dixon from Github presented his view on current and future state-of-the-art monitoring tools at DevOps Days in Rome. He envisions composable monitoring systems with interchangeable components focused on a single responsibility.

According to Jason, such system architecture should show the following characteristics:

composable ("well-defined responsibilities, interfaces and protocols")
resilient ("resilient to outages within the monitoring architecture")
self-service ("doesn't require root access or an Ops member to deploy")
automated ("it's capable of being automated")
correlative ("implicitly model relationships between services")
craftsmanship ("it's a pleasure to use")

Such a system would require the following components communicating with each other as depicted in the diagram below:

sensors "are stateless agents that gather and emit metrics to a log stream, over HTTP as JSON, or directly to the metrics store"
aggregators "are responsible for transformation, aggregation, or possibly simply relaying of metrics"
state engine "tracks changes within the event stream, ideally it can ascertain faults according to seasonality and forecasting"
storage engines "should support transformative functions and aggregations, ideally should be capable of near-realtime metrics retrieval and output in standard formats such as JSON, XML or SVG"
scheduler "provides an interface for managing on-call and escalation calendars"
notifiers "are responsible for composing alert messages using data provided by the state engine and tracking their state for escalation purposes"
visualizers "consist of dashboards and other user interfaces that consume metrics and alerts from the system"

Jason also stressed the need to plan for data collection and necessary architectural changes to be able to gather granular metrics. That will enable tracking trends and violation of thresholds predicted from historic data analysis.

InfoQ asked Jason about his current projects on this area:

On the visibility side, I continue to work on tools like Tasseo and Descartes to help improve Ops' response to outages. In particular with the latter, I think it's vitally important that we're able to correlate disparate metrics in real time. Often we find that outages are the result of cascading failures that are rarely visible from singleton graphs.
Separately, one of my pet peeves with Graphite is its lack of authorization and namespacing for metrics. I'm planning to add tokenized access for metrics submission to the Backstop project. This will allow admins to grant specific access to metric namespaces to individual developers or applications.

This and other presentations from DevOps Days in Rome were streamed here.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter