BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Charity Majors on Observability and Understanding the Operational Ramifications of a System

Charity Majors on Observability and Understanding the Operational Ramifications of a System

Key Takeaways

  • The current best-practice approaches to developing software -- microservices, containers, cloud native, schedulers, and serverless -- are all ways of coping with massively more complex systems. However, our approach to monitoring has not kept pace.
  • Majors argues that the health of the system no longer matters.  We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). 
  • Engineers are now talking about observability instead of monitoring, about unknown-unknowns instead of known-unknowns.
  • Databases and networks were the last two priesthoods of system specialists.  They had their own special tooling, inside language and specialists, and they didn't really belong to the engineering org.  That time is over. 
  • It will always be the engineer's responsibility to understand the operational ramifications and failure models of what we're building, auto-remediate the ones we can, fail gracefully where we can't, and shift as much operational load to the providers whose core competency it is as humanly possible.
  • Don't attempt to "monitor everything". You can't. Engineers often waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft. 
  • In the chaotic future we're all hurtling toward, you actually have to have the discipline to have radically fewer paging alerts - not more.
  • Majors caveats her advice by stating that many of us don't have the problems of large distributed systems.  If you can get away with a monolith and a LAMP stack and a handful of monitoring checks, you should absolutely do that. 
     

InfoQ recently sat down with Charity Majors, CEO of honeycomb.io and co-author of “Database Reliability Engineering” (with Laine Campbell), and discussed the topics of observability and monitoring.

InfoQ: Hi Charity, many thanks for speaking to InfoQ today. Could you introduce yourself, and also share a little about your experience of monitoring systems, especially data store technologies?

Yes!  I'm an operations engineer, co-founder, and (wholly accidentally) CEO of honeycomb.io.  I've been on-call for various corners of the Internet ever since I was 17 years old -- university, Second Life, Parse, Facebook.  I've always gravitated towards operations because I love chaos and data because I have a god complex.  I do my best work when the material is critical, unpredictable and very dangerously high stakes. Actually, when I put it that way, maybe it was inevitable for me to end up as CEO of a startup...

One thing I have never loved, though, is monitoring.  I've always avoided that side of the room.  I will prototype and build the v1 of a system, or I will do a deep-dive and debug or put right a system, but I steer away from the stodgy areas of building out metrics and dashboards, and curating monitoring checks.  It doesn't help that I'm not so capable when it comes to visualization and graphs.

InfoQ: Can you explain a little about how operational and infrastructure monitoring has evolved over the last five years? How have cloud, containers, and new (old) modular architectures impacted monitoring?

Oh man.  There's a tidal wave of technological change that's been gaining momentum over the past five years.  Microservices, containers, cloud native, schedulers, serverless... all these movements are ways of coping with massively more complex systems (driven by Moore's law, the mobile diaspora, and the platformization of technical products).  The center of gravity is moving relentlessly to the generalist software engineer, who now sits in the middle of all these APIs for in-house services and third-party services.  And their one job is to craft a usable piece of software out of the center of this storm.

What's interesting is that monitoring hasn't really changed.  Not in the past ... 20 years.

You've still got metrics, dashboards, and logs.  You've got much better ones!  But monitoring is a very stable set of tools and techniques, with well known edge cases and best practices, all geared around monitoring and making sure the system is still in a known good state.

However, I would argue that the health of the system no longer matters.  We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions).  With distributed systems you don't care about the health of the system, you care about the health of the event or the slice.

This is why you're seeing people talk about observability instead of monitoring, about unknown-unknowns instead of known-unknowns, and about distributed tracing, honeycomb, and other event-level tools aimed at describing the internal state of the system to external observers.

InfoQ: Focusing in on data store technology and your new book you have co-authored with Laine Campbell -- "Database Reliability Engineering" -- how has the approach to monitoring these technologies changed over the last few years?

Databases and networks were the last two priesthoods of system specialists.  They had their own special tooling, inside language and specialists, and they didn't really belong to the engineering org.  That time is over.  Now you have roles like "DBRE" (database reliability engineer), which acknowledges the deep specialist knowledge while also wrapping them into the fold of continuous integration/continuous deployment, code review, and infrastructure automation.

This goes for monitoring and observability tooling as well.  Tools create silos.  If you want your engineering org to be cross-functionally literate, if you want a shared oncall rotation ... you have to use the same tools to debug and understand your databases as you do the rest of your stack.  That's why honeycomb and other next-generation services focus on providing an software-agnostic interface for ingesting data.  Anything you can turn into a data structure, we can help you debug and explore.  This is such a powerful leap forward for engineering teams.

InfoQ: With the rise in popularity of DBaaS technologies like AWS RDS and Google Spanner, do you think the importance of monitoring database technologies has risen or fallen? And what has been the impact for the end users/operators?

Monitoring isn't really the point.  I outsource most of my monitoring to AWS, and it's terrific!  We use RDS and Aurora at honeycomb, despite being quite good at databases ourselves, because it isn't our core competency.  If AWS goes down, let them get paged.

Where that doesn't let me off the hook, is observability, instrumentation and architecture.  We have architected our system to be resilient to as many problems as possible, including an AWS Availability Zone (AZ) going down.  We have instrumented our code, and we slurp lots of internal performance information out of MySQL, so that we can ask any arbitrary question of our stack -- including databases.  This rich ecosystem of introspection and instrumentation is not particularly biased towards the traditional monitoring stack's concerns of actionable alerts and outages.

It will always be the engineer's responsibility to understand the operational ramifications and failure models of what we're building, auto-remediate the ones we can, fail gracefully where we can't, and shift as much operational load to the providers whose core competency it is as humanly possible.  But honestly, databases are just another piece of software.  In the future, you want to treat databases as much like stateless services as possible (while recognizing that operably speaking they aren't), and as much like the rest of your stack as possible.

InfoQ: What role do you think QA/Testers have in relation to monitoring and observability of a system, both from a business and operational perspective? Should the QA team be involved with the definition of SLOs and SLAs?

I've never worked with QA or testers.  I kind of feel like QA lost the boat a decade ago, and failed to move with the times.  I deeply love the operations engineering profession, and I'm trying to make sure the same doesn't happen to ops.  There will always, always be a place for operational experts ... but we are increasingly a niche role, and for most people we will live on the other side of an API. 

Developers will be owning and operating their own services, and this is a good thing!  Our roles as operational experts are to empower and educate and be force amplifiers.  And to build the massive world class platforms they can use to build composable infrastructure stacks and pipelines, like AWS .. and honeycomb.

InfoQ: What is the most common monitoring antipattern you see, both from the perspective of the data store and application? Can you recommend any approaches to avoid these?

"Monitor everything".  Dude, you can't.  You *can't*.  People waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft.  In the chaotic future we're all hurtling toward, you actually have to have the discipline to have radically *fewer* paging alerts ... not more.  Request rate, latency, error rate, saturation.  Maybe some end-to-end checks that stress critical Key Performance Indicator (KPI) code paths.

People are over-paging themselves because their observability blows and they don't trust their tools to let them reliably debug and diagnose the problem.  So they lean heavily on over-paging themselves with clusters of tens or hundreds of alerts, which they pattern-match for clues about what the root cause might be.  They're flying blind for the most part, they can't just explore what's happening in production and casually sate their curiosity.  I remember living that way too, and that's why we wrote honeycomb.  So we would never have to go back.

InfoQ: Thanks once again for taking the time to sit down with us today. Is there anything else you would like to share with the InfoQ readers?

Nothing I say should be taken as gospel.  Lots of people don't have the problems of large distributed systems, and if you don't have those problems, you shouldn't take any of my advice.  If you can get away with a monolith and a LAMP stack and a handful of monitoring checks, you should absolutely do that.  Someday you may reach a tipping point where it becomes harder and more complicated to achieve your goals *without* microservices and explorable event-driven observability, but you should do your best to put that day off.  Live and build as simply as you possibly can.

About the Interviewee

Charity Majors is a cofounder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging of complex systems. She has worked at companies like Facebook, Parse, and Linden Lab, as a systems engineer and engineering manager, but always seems to end up responsible for the databases too. She loves free speech, free software and a nice peaty single malt.

Rate this Article

Adoption
Style

BT