BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles 7 Habits of Highly Effective Monitoring Infrastructures

7 Habits of Highly Effective Monitoring Infrastructures

Bookmarks

Do you feel that your monitoring system is a complicated ball of contrived tools tenuously strung together, and ignored by everyone except the team holding the strings?

You should not have to solve the monitoring problem. Rather, monitoring should be solving problems for you. There is a right way and a wrong way to engineer effective telemetry systems and there is a finite combination of practices which — whatever your choice of individual tools — are predictive of success.

If you are building or designing your next monitoring system, take a look at this short list of habits exhibited by the most successful monitoring systems in the world today.

1. It’s About the Data

Monitoring tools are merely a means to obtain data. Awesome monitoring tools treat metrics and telemetry data as first class citizens, and go out of their way to make them easy to export. They like to send the data "up" to be processed, stored and analyzed, together with all the other data collected by all the other tools, on all the other systems, organization wide.

Tools that make data a first class citizen make it easier to cross-correlate measurements that were collected by other tools. You can, for example, quantify the effect of JVM garbage collection on service latency, or measure the relationship between thread-count and memory utilization. You know you are doing it right when you can "tee" off a subset of your monitoring data at will, and send it as input to any new tool you decide to use, in whatever format that tool expects.

Monolithic monitoring tools, on the other hand, often assume that you’ll never need to export the data they collect for you. The classic example is Nagios, which is, as you probably know, a tool designed to collect availability data at around 1-minute resolution. Because Nagios views itself as a monolithic monitoring tool, a plethora of single-purpose tools have sprung into being, for no other purpose than to take data from Nagios and place it in X, where X is some other monitoring tool from which it is usually even more difficult to extract the monitoring data. What you end up with is the now infamous anti-pattern of overly complex, difficult to repurpose, impossible to scale, single-purpose monitoring systems. Each tool we add to this chain locks us in further to the rest of the tool chain, by making it more difficult to replace any single piece.

If, however, we treat Nagios as merely one of many data-collectors — if we use it in a data-centric way instead of a tool-centric way — by, for example, placing a transmission layer like Heka or Reimann above Nagios, we see that even classically monolithic tools can be adapted to no longer depend on each other. In the process, we create a single source of telemetry data that we can wire to any new tools we want to use in the future.

2. Use Monitoring for Feedback

What would you monitor if you were able to pick and choose your metrics? How many of those metrics would you track and alert on?

Great monitoring systems are driven by purpose. They are designed to provide operational feedback about production systems to people who understand how those systems work. Importantly, these same people have chosen what to monitor about those systems based on that knowledge. The engineers in your organization should understand the metrics you monitor because each metric should have been configured by an engineer to answer a specific question, or provide a concrete insight about the operational characteristics of your service.

Monitoring isn't an endeavor unto itself. It is not a backup system or a disaster recovery plan, or any other sort of expensive, labor-intensive burden heaped on Ops to satisfy the checklist requirements of a regulatory body or an arbitrary quarterly goal. It is not a ritual the grown-ups tell us to follow — like keeping one’s hands and arms inside the vehicle at all times — to stave off some nameless danger that no one can quite articulate.

Monitoring is an engineering process. It exists to provide feedback from the things we build, maintain and care about. It is your best means of understanding the operational characteristics of the systems you depend on. It is the depth-gauge in your barrel of money, and the pressure meter on your propane tank. Through monitoring, we gain visibility into places we cannot go. We use it to quantify our success as well as to prevent explosions from happening.

3. Alert on what you Draw

When an engineer in your organization receives an alert from a monitoring system, and moves to examine a graph of monitoring data to analyze and isolate the problem, it's critically important that the same data was used to generate both the alert and the graph. If, for example, you're using Nagios to check and alert, and Ganglia to draw the graphs, you're raising the likelihood of uncertainty, stress, and human error during the critically important time of incident response.

One monitoring system or the other could be generating false positives or negatives, they could each be monitoring subtly different things under the guise of the same name, or they could be measuring the same thing in subtly different ways. It actually doesn't matter, because there is likely no way to objectively tell which system is correct without a substantial effort.

Even if you do figure out which of your systems is lying, it's unlikely you will be able to take a meaningful corrective action to synchronize the behavior of the systems. Ultimately, what you've done is shifted the problem from "improve an unreliable monitoring system" to "make two unreliable monitoring systems agree with each other in every case". The inevitable result is that your engineers will begin to ignore both monitoring systems because neither can be trusted. Great monitoring systems require a single source of truth.

4. Standardize Processing, but Liberate Collection

There’s a popular notion among consultants that the proper way to implement monitoring solutions is to first create a plan that lists every possible service that you could ever want to monitor, and then choose a tool that meets your data collection list. In our experience, great monitoring systems do the opposite. They plan and build a substrate: a common, organization-wide service for processing telemetry data from monitoring systems. Then they enable and encourage every engineer regardless of team affiliation or title to send monitoring data to it by whatever means necessary.

Awesome monitoring systems standardize on a single means of metrics processing, storage, analysis, and visualization, but they declare open season on data collectors. Every engineer should be free to implement whatever means she deems appropriate to monitor the services she's responsible for. Monitoring new stuff should be hassle free.

5. Let the Users Define Their Own Interactions

Another popular notion in the corporate world is that monitoring systems should provide a "single pane of glass", i.e., a single, unified dashboard that shows a high-level overview of the entire system state. The best monitoring tools focus instead on enabling engineers to create and manage their own dashboards, thresholds, and notifications. Turn-key dashboards are a good start, but it’s far more important to create a system that encourages the people who know how the systems work to curate meaningful collections of metrics.

Great monitoring systems represent a single source of truth that is so compelling and easy to interact with that the engineers naturally rely on them to understand what's going on in production. When they want to track how long a function takes to execute in production, they naturally choose to instrument their code and observe feedback using the monitoring system. When they have an outage, their first thought is to turn to the dashboard for that service before they attempt to SSH to one of the hosts they suspect is involved.

Great monitoring systems inspire and invite adoption. A monitoring system that requires coercion for using it is solving predefined, theoretical problems, rather than actual problems in usage that your team members have to fix. If your engineers are avoiding the monitoring system or rolling their own tools to work around it, you should ask yourself why they prefer the tools they do over those you want them to use, and focus on creating a system that works around their needs.

6. Include Monitoring in the Software Development Lifecycle

Monitoring is unit testing for operations. For all distributed applications – and, we'd argue, for a great deal of traditional services – it is the best if not the only way to verify that your design and engineering assumptions bear out in production. Further, instrumentation is the only way to gather in-process metrics that directly correspond to the well-being and performance of your production applications.

Therefore, instrumentation is production code. It is a legitimate part of your application, not extraneous debugging text that can be slovenly implemented with the implicit assumption that it will be removed later.

Your engineers should have libraries at their disposal that enable them to thoughtfully and easily instrument their application in a way that is commonly understood and repeatable. Libraries like Coda-Hale Metrics are a fantastic choice if you don't want to roll out your own. In the same way your feature isn't complete until you provide a test for it, your application is not complete until it is instrumented so its inner workings can be verified by the monitoring data stream.

7. Evolve Rather Than Transform

Healthy monitoring systems don't need a semi-monthly maintenance procedure. They remain relevant because they're constantly being iterated by the engineers that rely on them to solve their everyday problems. New metrics are added by engineers who are instrumenting a new service, or trying to understand the behavior of some reclusive piece of infrastructure or code. The team that creates measurements removes them when they're no longer needed, because they're superfluous and cluttering up the dashboards. Collecting, storing and visualizing new metrics is so painless that engineers add new metrics throughout the day.

By focusing on the data, relying on the accuracy of the results, and enabling everyone to iteratively fix the pieces they rely on, your monitoring system will evolve into exactly what your organization needs it to be: a stable, self-reliant, scalable and efficient tool that your team actually likes using.

About the Author

As the developer evangelist for Librato, a part of SolarWinds Cloud, Dave Josephsen hacks on tools and documentation, writes about statistics, systems monitoring, alerting, metrics collection and visualization, and generally does anything he can to help engineers and developers close the feedback loop in their systems. He's written books for Prentice Hall and O'Reilly, speaks shell, Go, C, Python, Perl and a little bit of Spanish, and has never lost a game of Calvinball.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Thanks very informative

    by Kevin Grant,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    That's a great article, we use Stackify (I believe it is stackify.com )to do everything that was described here and it has improved our responsiveness and the fixes velocity they offer also log management integrated into the solution which makes it more efficient.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT