Facilitating the spread of knowledge and innovation in professional software development



Choose your language

InfoQ Homepage News The Future of Monitoring: an Interoperable Architecture

The Future of Monitoring: an Interoperable Architecture


Jason Dixon from Github presented his view on current and future state-of-the-art monitoring tools at DevOps Days in Rome. He envisions composable monitoring systems with interchangeable components focused on a single responsibility.

 According to Jason, such system architecture should show the following characteristics:

  • composable ("well-defined responsibilities, interfaces and protocols")
  • resilient ("resilient to outages within the monitoring architecture")
  • self-service ("doesn't require root access or an Ops member to deploy")
  • automated ("it's capable of being automated")
  • correlative ("implicitly model relationships between services")
  • craftsmanship ("it's a pleasure to use")

 Such a system would require the following components communicating with each other as depicted in the diagram below:


  • sensors "are stateless agents that gather and emit metrics to a log stream, over HTTP as JSON, or directly to the metrics store"
  • aggregators "are responsible for transformation, aggregation, or possibly simply relaying of metrics"
  • state engine "tracks changes within the event stream, ideally it can ascertain faults according to seasonality and forecasting"
  • storage engines "should support transformative functions and aggregations, ideally should be capable of near-realtime metrics retrieval and output in standard formats such as JSON, XML or SVG"
  • scheduler "provides an interface for managing on-call and escalation calendars"
  • notifiers "are responsible for composing alert messages using data provided by the state engine and tracking their state for escalation purposes"
  • visualizers "consist of dashboards and other user interfaces that consume metrics and alerts from the system" 

Jason also stressed the need to plan for data collection and necessary architectural changes to be able to gather granular metrics. That will enable tracking trends and violation of thresholds predicted from historic data analysis.

InfoQ asked Jason about his current projects on this area:

On the visibility side, I continue to work on tools like Tasseo and Descartes to help improve Ops' response to outages. In particular with the latter, I think it's vitally important that we're able to correlate disparate metrics in real time. Often we find that outages are the result of cascading failures that are rarely visible from singleton graphs.

Separately, one of my pet peeves with Graphite is its lack of authorization and namespacing for metrics. I'm planning to add tokenized access for metrics submission to the Backstop project. This will allow admins to grant specific access to metric namespaces to individual developers or applications.

This and other presentations from DevOps Days in Rome were streamed here.

We need your feedback

How might we improve InfoQ for you

Thank you for being an InfoQ reader.

Each year, we seek feedback from our readers to help us improve InfoQ. Would you mind spending 2 minutes to share your feedback in our short survey? Your feedback will directly help us continually evolve how we support you.

Take the Survey

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • seriously "metrics" monitoring....

    by William Louth,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    The answer lies in emergent computing and adaptive control that is local and immediate. Local in that observation, judgement and reaction are collocated with the normal processing via embedded controllers and sensors weaved into applications (at runtime). Immediate in that the time interval between measuring, sensing and signaling (possibly to a remote station) and the actuating is at the same resolution of the underlying task/transaction processing that is being monitored, managed and controlled.

    For this to happen we need for IT to change starting with how it (or its systems) observe. Moving from logging to signaling. Moving from monitoring to metering. Moving from correlation to causation. Moving from process to code then context. Moving from state to behavior then traits. Moving from delayed to immediate. Moving from past to present. Moving from central to local. Moving from collecting to sensing. When that has occurred we can then begin to control via built in controllers and supervisors.

    In considering runtime application diagnostics and performance analysis provided by an application performance monitoring solution particular attention should be given to time, space and data, wherein time is the delay period from the moment an event occurs until it is classified and analyzed, space the distribution of monitoring functionality which can be centralized or distributed, partitioned or replicated, and finally data the collection, modeling and sharing of measurement based observations.

    Data in Real-time (DIRT)
    With environments becoming much more dynamic in terms of workload, capacity, code, and topology as well as increasingly distributed its seems futile to be still trying to manage the performance of applications the way it has always been done. What is needed is DIRT – data in real-time. Data that is accessible at the point of its creation (measurement and collection) and within its current execution context be it a process, thread, transaction or request. Data that informs the application of its immediate past, its current processing and its predicted path. This data has much more value but only if it can be acted on in near real-time at the resolution of the processing itself. Beyond this the data is bound for a black hole unless it can be mined for patterns which are then codified into controllers or supervisory routines.

    And finally if you really want to rethink monitoring on a much bigger stage

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.