The Future of Monitoring: an Interoperable Architecture
Jason Dixon from Github presented his view on current and future state-of-the-art monitoring tools at DevOps Days in Rome. He envisions composable monitoring systems with interchangeable components focused on a single responsibility.
According to Jason, such system architecture should show the following characteristics:
- composable ("well-defined responsibilities, interfaces and protocols")
- resilient ("resilient to outages within the monitoring architecture")
- self-service ("doesn't require root access or an Ops member to deploy")
- automated ("it's capable of being automated")
- correlative ("implicitly model relationships between services")
- craftsmanship ("it's a pleasure to use")
Such a system would require the following components communicating with each other as depicted in the diagram below:
- sensors "are stateless agents that gather and emit metrics to a log stream, over HTTP as JSON, or directly to the metrics store"
- aggregators "are responsible for transformation, aggregation, or possibly simply relaying of metrics"
- state engine "tracks changes within the event stream, ideally it can ascertain faults according to seasonality and forecasting"
- storage engines "should support transformative functions and aggregations, ideally should be capable of near-realtime metrics retrieval and output in standard formats such as JSON, XML or SVG"
- scheduler "provides an interface for managing on-call and escalation calendars"
- notifiers "are responsible for composing alert messages using data provided by the state engine and tracking their state for escalation purposes"
- visualizers "consist of dashboards and other user interfaces that consume metrics and alerts from the system"
Jason also stressed the need to plan for data collection and necessary architectural changes to be able to gather granular metrics. That will enable tracking trends and violation of thresholds predicted from historic data analysis.
InfoQ asked Jason about his current projects on this area:
On the visibility side, I continue to work on tools like Tasseo and Descartes to help improve Ops' response to outages. In particular with the latter, I think it's vitally important that we're able to correlate disparate metrics in real time. Often we find that outages are the result of cascading failures that are rarely visible from singleton graphs.
Separately, one of my pet peeves with Graphite is its lack of authorization and namespacing for metrics. I'm planning to add tokenized access for metrics submission to the Backstop project. This will allow admins to grant specific access to metric namespaces to individual developers or applications.
seriously "metrics" monitoring....
The answer lies in emergent computing and adaptive control that is local and immediate. Local in that observation, judgement and reaction are collocated with the normal processing via embedded controllers and sensors weaved into applications (at runtime). Immediate in that the time interval between measuring, sensing and signaling (possibly to a remote station) and the actuating is at the same resolution of the underlying task/transaction processing that is being monitored, managed and controlled.
For this to happen we need for IT to change starting with how it (or its systems) observe. Moving from logging to signaling. Moving from monitoring to metering. Moving from correlation to causation. Moving from process to code then context. Moving from state to behavior then traits. Moving from delayed to immediate. Moving from past to present. Moving from central to local. Moving from collecting to sensing. When that has occurred we can then begin to control via built in controllers and supervisors.
In considering runtime application diagnostics and performance analysis provided by an application performance monitoring solution particular attention should be given to time, space and data, wherein time is the delay period from the moment an event occurs until it is classified and analyzed, space the distribution of monitoring functionality which can be centralized or distributed, partitioned or replicated, and finally data the collection, modeling and sharing of measurement based observations.
Data in Real-time (DIRT)
With environments becoming much more dynamic in terms of workload, capacity, code, and topology as well as increasingly distributed its seems futile to be still trying to manage the performance of applications the way it has always been done. What is needed is DIRT – data in real-time. Data that is accessible at the point of its creation (measurement and collection) and within its current execution context be it a process, thread, transaction or request. Data that informs the application of its immediate past, its current processing and its predicted path. This data has much more value but only if it can be acted on in near real-time at the resolution of the processing itself. Beyond this the data is bound for a black hole unless it can be mined for patterns which are then codified into controllers or supervisory routines.
And finally if you really want to rethink monitoring on a much bigger stage