5 years of metrics and monitoring
Lindsay Holmwood made a retrospective about metrics and monitoring in his DevOps Days Belgium talk, listed his typical metrics and monitoring pipeline, exposed some flaws in monitoring systems, and his view of what the future may bring in the field.
Lindsay answered the typical retrospective questions about what did we learn, what did we do well, or what should we do differently, and listed the more widely used metrics and monitoring tools:
- Collection: Collectd and StatsD.
- Storage: Graphite, OpenTSDB and InfluxDB.
- Aggregation: Riemann.
- Checking: Nagios, Sensu.
- Alerting: PagerDuty, VictorOps and OpsGenie.
Some solutions, on Lindsay opinion, are done plainly wrong, with popular tools (Graphite as example) ignoring facts from data display studies. For example, based on those studies, the basic graph layout should be black on white, and pie chart comparisons are less accurate than bar charts, and yet they are still being used. Nagios was mentioned as a tool, that even with its flaws, is here to stay, because unfortunately there is no strong compelling alternative.
Another unresolved task is analyzing and acting on the data. Checks need to move from just numbers or strings to do anomaly and fault detection with trend analysis and thresholding, more complex conditions are needed as well as self learning algorithms to improve real time monitoring.
Monitoring is CI for production.
Lindsey mentions alert fatigue as one the most recognized problems with monitoring, too many alerts make it impossible to recognize which ones are important and which ones are not, causing real problems to go unattended. That is another area of improvement for the tools and implementations in the future.
While the past 5 years focused on building tools, formalizing relationships and search for parallels in other industries, Lindsay predicts that the next 5 years will bring stabilization of tools, emerging standards, exploiting the parallels with other industries, and mitigating human impact. The future is richer metadata about metrics to automatically build appropriate visualizations, and to enable developers to access that monitoring data. Operations needs to act as enablers, not gatekeepers, providing the platform and coaching on what it makes a good check or alert, listening to the needs of the end user.
DevOps Days will continue tomorrow with more talks, celebrating the 5 years of the first DevOps Days Belgium. The conference talks are streamed live, and comments can be followed in Twitter using the #devopsdays tag.