Key Takeaways
- Mike Julian’s new book, Practical Monitoring provides a foundational introduction to the topic of monitoring, and presents an overview of core principles, monitoring antipatterns, and monitoring design patterns.
- Monitoring is an action -- a thing you do --, while observability is an attribute of a system that enables monitoring. The more observable a system is, the better you can monitor it, the better you can reason about it, the better you can simply understand how it works.
- The rise of ephemeral infrastructure (e.g. containers, short-lived cloud instances, serverless) and distributed architectures has drastically changed how we do monitoring.
- The antipatterns “tool obsession”, “monitoring-as-a-job” and “checkbox monitoring”, reinforce Julian’s argument that any approach to monitoring should be holistic.
- The most common antipattern, and most insidious one, is constantly looking for the next hot tool that’s going to solve all your problems.
- When it comes to business metrics it is crucial that everyone at least understands what these metrics are, why they matter, and how the app/infrastructure makes them available.
- Good KPIs to monitor can be identified by asking a product manager or nearest VP several key questions: How does the business make money? How do we know if we’re doing well or doing poorly? What are the targets for those metrics?
Mike Julian has recently published Practical Monitoring with O’Reilly, which aims to provide readers with a foundational introduction to the topic of monitoring, as well as practical guidelines on how to monitor service-based applications and cloud infrastrastructure.
Julian discusses in the preface that the monitoring landscape of today is vastly different than it was only a few years ago, and even more so than it was ten years ago. With the widespread popularity of ephemeral cloud infrastructure and architectural approaches like microservices came new problems for monitoring, as well as creating new ways to solve old problems. The book aims to address these issues, and also answer common monitoring questions, such as “Do you have a nagging feeling that your monitoring needs improvement, but you’re just not sure where to start or how to do it? Are you plagued by constant, meaningless alerts? Does your monitoring system routinely miss real problems?”
Practical Monitoring is focused toward readers seeking a foundational understanding of monitoring. The preface states that it is suitable for junior staff as well as non-technical staff looking to increase their knowledge on monitoring, and “if you already have a great grasp on monitoring, this probably is not the book for you”. Although Julian introduces and discusses many modern monitoring tools -- such as statsd, InfluxDB, Prometheus, and Sensu -- he does not provide deep-dives into specific tools, and instead focuses on practical, real-world examples on how such tools should be deployed within a holistic approach to monitoring.
The book begins with an overview of monitoring principles, and looks at monitoring antipatterns as well as current “good practice” monitoring design patterns. The antipatterns “tool obsession”, “monitoring-as-a-job” and “checkbox monitoring”, reinforce Julian’s argument that any approach to monitoring should be holistic. The best practice patterns presented, such as “composable monitoring”, “monitor from the user perspective” and “continual improvement” also demonstrate the influence of modern software engineering approaches, such as a focus on modularity, cultivating a user-centric approach, and principles from Lean. Julian also covers how to create effective alerting, and discusses the associated people and organisational challenges of being on-call and managing incidents. This section of the book concludes with a basic primer of the use of statistics within monitoring, and covers the use of mean, average and median, as well as quantiles and standard deviation.
The remainder of Practical Monitoring covers monitoring tactics, and includes a discussion of monitoring from the perspective of both the business and technology. The chapter on monitoring the business discusses concepts such as Key Performance Indicators (KPIs), and provides techniques to identify and capture these. Real world use cases are also presented, which help make the explanations and guidance very understandable. Monitoring tactics from the technology perspective is provided within a chapter for each of following: frontend, application (backend), server, network and security.
The concluding chapter examines how to conduct a monitoring assessment, and re-visits key concepts from the rest of the book with a focus on how to identify associated current strengths and weaknesses within an organistion. Suggestions on how to prioritise monitoring efforts are provided, and Julian leaves a clear message for the reader that “monitoring is never done, since the business, application and infrastructure will continue to evolve over time.”
Practical Monitoring will be available on full release in November, and is currently available to purchase via the companion website and also Safari books. InfoQ recently sat down with Mike Julian to find out more about the motivations for writing the book.
InfoQ: Hi Mike, many thanks for speaking to InfoQ today. Could you introduce yourself, and also say a little about your motivations for writing your new book "Practical Monitoring" please?
Hi, I’m Mike Julian. I’m a former operations engineer turned business owner. I run a consulting company called Aster Labs where I focus on helping companies improve their monitoring. I’m the editor of Monitoring Weekly, a weekly email newsletter about all-things-monitoring. I’m also involved in a few other projects, which you can find on my personal site, mikejulian.com.
Every time I was at an event, the pub, or coffee shop and someone would find out that I’m “the monitoring guy,” the very next question would be something along the lines of, “My monitoring stinks. What should I do?” or my personal favorite, “What’s the best monitoring tool these days?” After the bajillionth time answering the same questions with the same answers, I decided to just write a book about it all. Specifically, I wanted a book that wasn’t oriented around how to use this tool or that tool and instead talked about the principals behind monitoring. And thus, Practical Monitoring was born.
InfoQ: Your book makes good use of design patterns, which many developers can relate to. Can we ask why you chose this approach?
You know, it was actually accidental. I begin the book with what I think is the most important topic: antipatterns and things not to do. After writing that chapter, I realized I should probably tell people what they should do too, making “Chapter 2: Design Patterns” a thing. I think it worked out quite well. Ultimately, it fits very well with my tool-agnostic approach to monitoring in that you should focus on good patterns, avoid bad ones, and everything else will fall into place.
InfoQ: Can you explain a little about how operational and infrastructure monitoring has evolved over the last five years? How have cloud, containers, new data store technologies and new language runtimes impacted monitoring?
The rise of ephemeral infrastructure (e.g. containers, short-lived cloud instances, serverless) and distributed architectures has changed how we do monitoring drastically. Even five years ago, Graphite and statsd were still cutting edge tools, and emitting metrics from inside the app was a novel idea for many teams. Nowadays, not only is such a setup commonplace, but many teams are finding it insufficient.
Specifically, we’re now talking about how to handle millions of metrics, how to monitor code that exists for fractions of seconds, how to effectively trace requests through hundreds of microservices, and more. I think these problems transcend languages and storage backends, and speak more directly to how we reason about the systems we build. This is certainly a much harder (and more interesting!) problem than monitoring, say, the latest NoSQL datastore.
InfoQ: What role do you think QA/Testers have in relation to monitoring and observability of a system, both from a business and operational perspective?
I think it’s actually a mixed bag for QA teams: as applications and systems become more observable and capable of checking and reporting on their own health/functionality, the role of QA diminishes significantly.
On the other hand, QA is in a great position to work with engineering on what metrics and health checks the app needs to make the QA team’s job easier and more automated. Certainly, there are some aspects of a system that can’t easily be automated for testing, but those that can be automated should be. QA is in the best position to say how that should look.
InfoQ: How important do you believe it is for engineers to understand statistics in relation to monitoring? Can you provide any recommended key things to learn?
You can get surprisingly far with very basic statistics. A cursory understanding of the use and limitations of averages, median, and percentiles really solves for a lot of the use cases the typical engineer is likely to encounter. For example, one of the most misunderstood statistical concepts is that of the percentile and the limitations on it. If you record the 90th percentile of a dataset every week over 12 weeks and then average those 12 data points together, the answer is inaccurate (because a percentile is intentionally losing data). In order to calculate the 90th percentile for a 12 week period, you’d need to have the full 12 weeks of data.
If you want a book about stats that’s more approachable than your college textbook, I really enjoyed reading Naked Statistics by Charles Wheelan during my research.
InfoQ: What is the most common monitoring antipattern you see? Can you recommend an approach to avoid this?
The most common one, and most insidious one, is constantly looking for the next hot tool that’s going to solve all their problems (Chapter 1, Anti-pattern #1: Tool Obsession). You can read more about it in the book, but the quick version is that there is no magic here and teams have done quite well with awful tools. I’ve seen plenty of teams using the latest tools and failing miserably to get any effective monitoring built. As the old saying goes, a craftsperson doesn’t blame their tools for bad work.
The solution is to recognize that your tools probably aren’t the problem, and that you need to look much deeper at what you’re actually doing, how you’re monitoring apps and infrastructure, and why you think your monitoring isn’t very good. 99% chance that your tools are fine, and in fact your strategy is to blame.
InfoQ: There has been some great discussion recently about monitoring versus observability from engineers like Cindy Sridharan and Charity Majors. Can we ask your thoughts on this topic please?
I think Cindy is brilliant and totally on point about all of it. If I really had to sum it up, I’d put it this way: monitoring is an action -- a thing you do --, while observability is an attribute of a system that enables monitoring (credit to Baron Schwartz for that take on it).
Many of you have no doubt been in the situation where you’re trying to monitor some homegrown application only to realize it’s a black box with no logs or metrics - that’s an unobservable system. The more observable a system is, the better you can monitor it, the better you can reason about it, the better you can simply understand how it works. Really, improving observability is a matter of improving the application.
I don’t talk about observability much in the book, and instead conflate it with monitoring. That was intentional. Observability versus monitoring is a nuanced topic and not one that really matters when you’re in the situation of just getting started with monitoring. I imagine that once your monitoring matures, the concept of observability will begin to matter a lot more to you and your team.
InfoQ: You talk about business metrics and KPIs in the book. Who do you believe is most responsible for ensuring these are implemented: product owners, developers or operators? Or is it a team effort, and if so, how should everyone work together?
It’s really a team effort, though everyone has a different role to play. For example, let’s take the example of user growth over time on a SaaS app. Product owners/managers define that this is something they care about, and developers write the code to make reporting on that data easy.
In a more complex scenario, technical operations/system administrators will have a role: the cost to service a user. Calculating how much your infrastructure costs per user is a great way to eventually increase profit margins, but also helpful to understand if your current infrastructure is tenable or not. For example, if the cost to provide service to a customer outpaces the revenue from a customer, you’ve got a bit of a problem on your hands, and this kind of data is something that system administrators will have (or can calculate) that the business is often just guessing at.
No matter who does what, when it comes to business metrics I think it’s crucial that everyone at least understands what these metrics are, why they matter, and how the app/infrastructure makes them available.
InfoQ: Can you share any tactics for an engineer that wants to understand and implement KPIs for the business? Where is the best source of KPIs, and how should engineers present results back to the business?
Sit down with a product manager or your nearest VP and ask them a few questions: How does the business make money? How do we know if we’re doing well or doing poorly? What are the targets for those metrics?
You’ll get a great sense of how the business actually functions, and what matters. You can follow it up with one last question: what data that you don’t currently have would make decisions easier? Sometimes you can help with that problem, sometimes you can’t.
Either way, having a better understanding of how the business works and what data is used to judged the health of the company is always valuable.
InfoQ: Thanks once again for taking the time to sit down with us today. Is there anything else you would like to share with the InfoQ readers?
Thank you! It’s been a pleasure. The last thing I want to say is this: improving monitoring is a journey, and often a long one. Improve a small amount every day and you’ll do fine, but don’t expect a major overhaul overnight or even by next month.
Further information on the book can be found on this website, and also also on Safari.
About the Book Author
Mike Julian is a consultant who helps companies build better monitoring for their applications and infrastructure and the Editor of Monitoring Weekly, an online publication about all-things-monitoring. Mike has previously worked as an Operations/DevOps Engineer for Taos Consulting, Peak Hosting, Oak Ridge National Lab, and others. Mike is originally from Knoxville, TN and currently resides in San Francisco, CA. Outside of work, he spends his time driving mountain roads in a classic BMW, reading, and traveling.You can find Mike at: Mike Julian, Aster Labs, Monitoring Weekly