Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Observability-Driven Development for Tackling the Great Unknown

Observability-Driven Development for Tackling the Great Unknown

Leia em Português


Key Takeaways

  • As microservices, containers and other new architectural components make systems more distributed, complicated and unpredictable, there is an increase of unknown unknowns.
  • Monitoring only after release no longer scales, as it only works when systems fail in predictable ways. There’s a drive to embrace the uncertainty of modern systems with an open-ended exploratory interrogation of them..
  • Observability-driven development (ODD) uses data and tooling to observe the state and behavior of a system before, during and after development to learn more about its patterns of weakness.
  • Observability is about gathering detail at the event level in a format that lets us ask any question about our systems from the user's perspective, known or known — without shipping new code to ask that question.
  • Observability, like the entire DevOps movement, is about being a better software steward, leaving bread crumbs to explain to the next developer why you did that to your product.

There’s no doubt that systems become more and more complicated the more distributed they are. It makes 24/7 monitoring and on-call rotations essential for most companies. But how has observability affected our ever shorter DevOps feedback loops? In this article, we summarize learnings about observability-driven development from Charity Majors, founder of Honeycomb and fierce proponent of observability.

Observability: The key to software ownership

"Originally the feedback loop was you would break stuff, people would yell at you, and then they would praise you when you fixed it, but then the Internet became a thing and our systems got more complicated,” Majors told the crowd at CloudNative 2018 in London.

Majors recalled that we all began by owning our own software — because who else was going to own it — but, as we got farther from that ownership, we lost the ability to know when something isn’t right. Silos got developers further and further away from running their code and the DevOps movement appeared as “an attempt to return to grace, to return to the state of that virtuous feedback loop,” Majors said.

Each developer needs to own their code, with the ability to deploy it and debug it in production. Anything less is not full DevOps ownership, Majors argues:

It makes software better, the person who is debugging has the most relevant context to debug it when it’s live.

For a long time, software engineers could kinda get away with writing code, hitting save, and going home and leaving the consequences for ops to battle. It was never great, but it was ‘OK’. But now? Most of the time, the only person who has a chance of swiftly debugging the complex interactions that led to a new error surfacing is the person who still has all that code and parts of the system in their head right now -- the person who was just modifying it.

The purpose of DevOps automation isn’t just speed, it’s about leveraging the intrinsic motivation and creativity of developers again by freeing them from non-creative, tedious repair work:

What makes us feel fulfilled and satisfied is all about autonomy, feeling empowered, building something that matters, and caring about it being done well.

This, of course, is in line with Dan Pink’s three key intrinsic motivators for knowledge workers: autonomy, mastery, and purpose.

However, time and again devs are releasing code that looks fine in their local setup, but as soon as they deploy, all sorts of issues ensue. It can take days to discover what’s wrong, let alone fixing the issue. Majors warns:

The first lesson of distributed system is your system is never up — its many catastrophic states exist at any given time.

However, once observability-driven development is in place with the right stack, instrumentation, and, especially, visualization, those same system flaws can be discovered and addressed much faster, typically hours or even minutes.

“Every software engineer should have it burned in as muscle memory: When you ship code, you should go look at that code. Is it doing what you expected it to? Does anything else look weird? You will catch 80 percent of problems before your users ever notice if you make a disciplined habit of this,” Majors said.

Testing in production is essential because each deploy is a unique combination of artifact, infrastructure state, date and time, and deploy scripts and environments, making some tests completely unique.

How observability-driven development is educating developers about their systems

What is observability-driven development? It’s leveraging tools and hands-on developers to observe the state and behavior of a system in a way that you learn more about said system, including patterns of weakness. ODD is actually interrogating the system, while monitoring is just setting and measuring thresholds for it.

Majors argues that test-driven development — the process of writing a test and then writing code that passes that test — is now ready to evolve into observability-driven development. Both fall under the umbrella of behavior-driven development, but ODD shows a greater understanding of that behavior.

Majors says observability-driven development is a process that you can use to be on call as well. Observability is part of control theory which examines how we can possibly control complicated distributed systems.

You understand what’s happening inside your system, inside your code, just by interrogating it. Can you answer new questions without shipping a new code?

Majors emphasized that it’s about achieving the right level of abstraction, not contributing to a more complicated codebase:

When you do have an observable system your team will quickly and reliably track down any new problem with no prior knowledge. [They will] understand UX and behavior and reasons for your codes and bugs.

Observability doesn’t negate monitoring, which is still an important piece of DevOps coverage. But according to Majors, monitoring has not kept up in the last 20 years, being still mostly suited for on-premise requirements.

It takes the remnants of outages gone past to translate it into what those dashboards mean — only about two percent of software engineers understand that.

Majors quoted workflow for automation tool Sensu’s VP of Engineering Greg Poirier  saying “monitoring is dead.” Poirier argues that it’s the act of observing and checking the behavior and outputs of a system and its components over time — a good definition of observability-driven development — which makes monitoring an outdated model for complex systems.

“It’s important to build tools for people that make sense with them so they live in one reality,” Majors said, talking about the need for clear, cross-organizational dashboards

For Majors, observability is about making sure the “known unknowns” are greater than or equal to the “unknown unknowns.”

“There are some problems you can only see if you are way out. You have to gather the detail that will let you find any of those,” she continued.

Majors calls observability a game of looking for outliers — if you have a dozen failures, what do they all have in common, based on collected and queried data? She said you should care about whether each request can succeed and if you can get the resources to work end-to-end.

The immense challenge of distributed systems is compounded by the fact that they’re actually more akin to an interconnected network of systems, many of which are out of our sphere of control. They are therefore impossible to observe directly.

Monitoring is monitoring. Observing is event-first testing in production.

“Every piece of architecture is unique so you have to test it. And you have to test it in prod because you can only test so much before you get into prod,” Majors said, explaining that deploying code is not an on-off, binary switch.

You should, of course, test before deploy in staging and after deploy in production, but that might exhaust the often limited engineering resources. Majors advocates for embracing the reality that you are testing in production whether you intend to or not, and recommends using techniques like canary testing as guardrails to help achieve observability.

She calls observability the missing link, allowing software owners to test in production, offering an event-first perspective of the software, how it’s being used and how it’s reacting to that use.

Majors says good automated monitoring includes these best practices:

  • Many actionable active checks and alerts
  • Proactively notifying engineers of failures and warnings
  • Maintaining a runbook for stability and predictability in production systems
  • Expecting clusters and clumps of tightly coupled systems to all break at once

But with microservices it gets much more complicated, with many more of the “unknown unknowns.”

There are so many components and storage systems, you cannot model the entire system in your head. The health of the system is irrelevant — the health of every individual request is of supreme consequence.

Monitoring only covers the known unknowns, which become a support problem. These are predictable and can be dealt with in a predictable time frame, and can be monitored on a dashboard.

The “unknown unknowns” remain an engineering problem. These are often open-ended in the timeframe it takes to fix, and they require exploration of the systems and creativity. It’s what devs should be spending their time on. Observability tackles this great unknown.

Observability is event-level introspection to align the developer’s reality with the user’s

Majors says, to do observability right, you should bring it in the moment you even consider building something, making it an inherent part of your development process.

She says it’s about hunting your unknowns, fine-tuning your instruments from the inside-out. And it’s about being a good software steward for the next user, even if that user is you six months down the line wondering why you ever made that choice.

Getting inside the software’s head and explaining it back to you and a naive user — finding those breadcrumbs and leaving them so your future self can trace it back to the source.

Monitoring is mostly about metrics, while observability is about events. Majors recommends that you first focus on debugging high-cardinality events — important but often unique information like identification numbers, user names, and email addresses — because these involve a lot of context and tracking.

She says events tell stories that help you uncover that context and the outliers, which in turn helps you identify what’s wrong. She continued that bandwidth and cost restrict your ability to store "all the data all of the time”, but it’s essential to structure your aggregated data logs “because this how our brains work and this is how our code works.”

According to Majors, creating a dashboard is not the answer:

It’s an artifact of some past failure. It jumps to the answer instead of starting with a question,

Majors argues for everything in real-time, instead of grasping at trends. She continued to explain that observability requires the ability to drill down the sampled data all the way to the raw requests. Aggregation doesn’t work because you can never expand again the data that was previously merged together. Sampling, however, allows you to retain a sufficient level of detail for asking more questions later on. Observability is about asking a question and following the answer. Then asking a new question. And repeating this cycle until you discover that “unknown unknown”.

Service owners, not just operators

Services really do need owners, not operators, and they need their owners to care about observability before you even write one line of code.

In the new world of always-on DevOps, this means developers need a higher level of operational literacy and to be fluent in their own code. Majors says to achieve this proficiency and to really get a grasp on what “abnormal” looks like, devs need to be watching their code run in production. Majors suggests this could bring down the number of production incidents by up to 80 percent.

She reckons that at some point in the future artificial intelligence and machine learning will get to the level that software becomes context-aware and self-healing, but only in the distant future where the computers that write the code understand the original intent behind it.

Majors’ key premise is that proper observability will lead to drastically fewer pager alerts.

About the Author

Jennifer Riggins is a tech storyteller and writer, where digital transformation meets culture, hopefully changing the world for a better place. Follow her on Twitter @jkriggins.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • 'Architectural Complexity' image

    by Razvan Gaston,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Would it be possible, for 'Architectural Complexity' image, to upload a version where you could zoom in and read the major components? (at least the bigger boxes).
    As it is right now, you can't ready any of the components shown in the diagram.

  • Re: 'Architectural Complexity' image

    by Manuel Pais,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Razvan, thank you for your message.

    I'm afraid we don't have a larger image. Note that it is just meant to illustrate the growth in architecture complexity, the details of the components were purposely kept difficult to read :)

  • Misses the "How"

    by Vinit Samel,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Sold on the Why and Details on the What... but How is missing. Working strategies on executing to ODD would have been great. Missed the landing a bit...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p