InfoQ Homepage Presentations Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics

Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics

View Presentation

Speed:

38:16

Summary

Courtney Nash presents data from the Verica Open Incident Database (VOID) to demonstrate how aggregate incident metrics (MTTR) aren't representative of systems' resilience.

Bio

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Nash: We're going to be talking about comparing apples with Volkswagens. I have a background in cognitive neuroscience. I was fascinated by how the brain works in terms of how we learn and how we remember things. Along the way, this funny thing called the internet showed up, and I ran off to join the internet. I worked at a bunch of places. I am now at a company called Verica. There, I started this thing called the VOID, the Verica Open Incident Database, which I will get to and I will talk about.

What Is Resilience?

We're in the resilience engineering track. This is obviously a topic near and dear to my heart. You're going to hear this a lot. We're going to talk about, what is resilience? I have in this VOID that I'm going to talk about lots of incidents. I'm going to share with you some things that people say about resilience. This is actually not an incident report, I lied. The first one is from Gartner. They said that cloud services are becoming more reliable, but they're not immune to outages, which we would all agree with that. The key to achieving reliability in the cloud is to build in redundancy and have a clear incident response. We're munging words already here in this space. How about what Facebook said? To ensure reliable operation, our DNS servers disable those BGP advertisements, so on and so forth. Duo is committed to providing our customers a robust, highly available service. An issue exposed a bug in our integration test suite. Microsoft, although the AFD platform has built-in resiliency and capacity, we must continuously strive to improve through these lessons learned. Also agree. We're using a lot of similar words and we're conflating things that we don't want to conflate. I think as engineers, we like to be precise. We like to be accurate in the terms we use and the way we use them.

This is my general feeling about where we're at as an industry right now. We talk about these things, but I don't think we really mean the same thing. I don't think we all know what we mean when we do that. The definition of resilience that I'd like to offer is that a system can adapt to unanticipated disturbances. This is a quote from a book by Sidney Dekker, "Drift into Failure." I've highlighted the things here. Resilience isn't a property. It's not something you can instill in your systems and then it just exists and you have it. It's a capability. It's an ongoing capability. It's actions that make resilience. You notice the things that he talks about in here is capability to recognize the boundaries, to steer back from them in a controlled manner, to recover from a loss of control, to detect and recognize. These are all things that humans do, that you do to keep our systems running the vast majority of the time. There are things you do when they stop doing what we hope they will do. What I'm here to really talk to you about is our efforts to try to measure this thing that we still haven't even collectively defined or agreed upon. We've already got it. We're already starting on shaky ground here.

Can We Measure It?

Let's talk about metrics with this. This is the gold standard. We talked about this, meantime to resolve, to restore, to remediate. It's probably something you've all heard. How many of you use this in your work? I want to talk a little bit about the origins, the history of this term. Many people might be familiar with this from the DORA work. MTTR is one of the four key metrics that they talk about in terms of high performing teams. That's not where it came from. Actually, MTTR came from old school line manufacturing, widgets, things that you make, and you make them over again. Then sometimes, either the process or the conveyor belt or the parts break down, and you have this very predictable time window that you know, over time, it takes us this long to fix widget x or that part of that. That's where mean time to repair came from, a predictable conveyor belt style production environment. Does that sound like anything that any of you deal with? No? This was the formula that came out of that environment. It's very straightforward. It works. It works in that environment, but this is our environment. This is the reenactment, actually. It's not the original. This is from Will & Grace. It's one of my favorite episodes of, I Love Lucy. I think it really captures what we're dealing with in our reality.

MTTR: What Is It Good For?

I want to talk about the way we talk about MTTR. These are actual terms or ways that people talk about MTTR from their incident reports, or from the internet. MTTR measures how quickly a team can restore a service when a failure impacts customers, allows enterprise level organizations to track the reliability and security of technical environments. Allows teams to set standards for reliability, accelerate velocities between sprints, and increase the overall quality of the product before it's delivered to the end user. MTTR captures the severity of the impact, shows how efficiently software engineering teams are fixing the problems. Specifies the severity of the impact, or perhaps offers a look into the stability of your software as well as the agility of your team in the face of a challenge. It also encourages engineers to build more robust systems.

There's that word again. Is one measure of the stability of an organization's continuous development process. Helps teams improve their systems' resilience. Evaluates the efficiency and effectiveness of a system or service. Measures the reliability and stability of the software that is delivered. Helps track reliability. Helps teams to improve their processes and reduce downtime. Assesses how resilient the software is during changes in runtime. Helps track the performance of both the Dev and Ops sides of the house. Can be a great proxy for how well your team monitors for issues and then prioritizes solving them. Low MTTR indicates that any failures will have a reduced business impact. Serves as a direct indicator of customer satisfaction. Directly impact system reliability, customer satisfaction, and operational efficiency. These are laudable things to want to know or understand or be able to measure. Could we all agree that one number couldn't possibly tell you all those things? Yes. I have even worse news. That number doesn't tell you what you think it tells you, and I have the data to prove it.

MTTR In the Wild: Data from the VOID

Now we get to talk about the VOID. The VOID, the Verica Open Incident Database is something I started almost two years ago. It came out of research I was doing for product, for this company Verica. We had a lot of things that were focused on Kubernetes and Kafka because those are really simple, and no one ever has any problems with them. Along the way, I wanted to see what was happening in the wild. I started collecting incident reports for those technologies. Then I just kept collecting incident reports. Then John sent me a whole lot of them. Then people kept sending them to me, and the next day I had like 2000. Now we have over 10,000. These are public incident reports. Have any of you written an incident report and published it on the internet? I read them. We collect metadata on top of these publicly written incident reports. Maybe about 600 organizations that are in there, some small, some large. Large, gigantic enterprises, small 2-person startups across a variety of formats, so retrospectives, deep post-mortem reviews, those things are in there, but so are other things: tweets, news articles, conference talks, status pages. I have a broader research goal that is why I have all of these things. We collect a bunch of metadata, the organization, the date of the incident, the date of the report, all of these things, if they're available in these reports. The last one being duration. If it's there, if it's in the status page metadata, or if the author of the report tells us that, we are going to use that information. I want to talk about some of the limits of this duration data, because that's the foundation of MTTR. You take all of these, and then you average them over time.

Duration: Gray Data

John Allspaw has done a great job of describing these types of data in general, these aggregate metrics as shallow data, but duration is a particularly gray version of shallow data. I want to show you how murky those shallows are. The problems that we have with the data we're feeding into this metric, is that duration is super high in variability but low in fidelity. It's fuzzy on both ends, like how do you decide when it started or when it stopped? Who decided? Was that automated? Was that updated? Did it ever get updated? Could you have a negative duration incident? Yes, you could. It's sometimes automated, sometimes not, all of these things. Inherently, it's a lagging indicator of what happened in your system and it's inherently subjective. When you average all of those together, you get a big gray blob, which doesn't tell you anything about what's actually happening below.

Let's get into the weeds now. There will be some statistics. Everybody has seen a normal distribution? We all know what this is. It's a standard bell curve. The mean is smack in the middle of that. If you have a normal distribution of your data, then you can do all kinds of cool things with the mean and standard deviations, and all this really great stuff. Your data aren't normal. Nothing's really normal when we do. These are actual histograms of duration data from incidents in the VOID. To make these histograms, we just bin the durations that we find, so everything under an hour, we count those up. Everything under two hours, we count those up. You all are pretty used to seeing histograms, I think. These are your data. I'm not making these up. There will be a simulation. These aren't simulations, this is real, what you're telling us is happening. Every single company's incident data, if you share them with me, look like this. I urge you if you haven't shared them to go and look at them, because this alone means that you can't take the mean, and that the mean is meaningless. If that alone doesn't convince you, then we've created a super cool new product. MTTR, it does all these things. It's the ShamWow of engineering metrics. We made a new product called TTReduce, and it's a 10% reduction in all of your incidents, magically.

You go out, you buy this product, you install it, you apply it. We're going to compare MTTReduce to no MTTReduce. This is a Monte Carlo experiment, it's very much like an A/B test, if you're familiar with that way of thinking about your website data or whatever, you're going to test a new feature or something, you compare them to each other. We have a control group, which is all of your incidents without the TTReduce product applied to it. Then this experimental group, we've shortened all of your incident durations by 10%. Then we're going to run a ton of simulations where we take the mean of those data, and then we compare them to each other. If 10%, we should be able to detect that. When you subtract the experimental group from the control group, you should be 10% better. The next graph that you're going to look at is going to show you those curves, for a bunch of companies. We've run the simulation. We've compared the mean of the control group to the mean of the experimental group. It should be a nice little curve, right around 10%. This is actually what they look like. The red one I will explain, but all of these have these slumpy, lumpy curves. Right around that 10% mark, you have plenty of other cases where actually your duration got worse to the left here, and lots of cases where you actually think you're doing better than you did, which is not an environment you want to exist in. Because we measure things to make decisions. Why measure it if you're not going to do something with it? Can you make a decision based on these data? I wouldn't want to be in charge of that. The red line is one particular company that sees a huge amount of traffic and a huge amount of internet data, and they have thousands of incidents. Who here wants to have thousands of incidents? Because that's the only way you're going to get close to that kind of fidelity, and even then, they're still wrong sometimes. We know that it doesn't help. The only way to have more data for MTTR is to have more incidents. That's not the business that we're in. This is the talk, it's like comparing apples to Volkswagens. Average all the sizes of all your apples and take all your Volkswagens, average that together. This is what you're doing when you're trying to use MTTR as a way of understanding what's happening in your systems.

If Not MTTR?

This is always the next question. It really is incident analysis, but it's a different focus of incident analysis. My point is that your systems are sociotechnical systems, so you need to collect sociotechnical metrics. This is totally possible. Is it hard? Yes. Is it work? Yes. Is it worth it? Absolutely. In fact, I harbor a real belief that companies that do this kind of work, have an advantage. I don't have the data to prove that yet. I'm not sure I will. I believe that it's true. These are some of the sociotechnical metrics that you can collect. Cost of coordination is one of my favorites. This is Dr. Laura McGuire's dissertation work that she did, and that she's been expanding on. Since then, it's things like how many people were hands-on involved in the incident, across how many unique teams, using what tools, in how many Slack or whatever channels? Were there concurrent incidents running at the same time? Who was involved in multiple incidents at the same time? Were PR and comms involved?

Did you have to get executives involved? This all tells you so much more about the scale, the impact, the intensity of that particular incident. There are other things like participation, the number of people reading write-ups, the number of people voluntarily attending post-incident review meetings. Are they linking to incident reports from code comments and commit messages, architecture diagrams, other related incident write-ups? Are executives asking for them, because they're starting to realize that there's value in these? Also, things like near misses. Are you able to look at the times where Amy was looking at the dashboard, and was like, did you guys see this? This is weird. Is that really happening? Then you fix it before it goes kablooey. That's pretty cool. That's a whole source of information about the adaptive capacity of your systems, knowledge gaps, assumptions, like misaligned mental models, and this term that I use a lot, which is safety margins.

I want to talk just a little bit about how you can learn some more of these things. Then I want to give you some examples of what people have learned from this way of thinking about their systems. This is a diagram from a researcher named Jens Rasmussen, it was from 1997. It's not like new, and it's not really old either. He arrived at this model, which I will describe, knowing nothing about technology. This is not about software systems, but it's pretty spot on. The notion he has is that you have these boundaries of different kinds of failure that you can cross in your system. You have your economic failure boundary up here to the right. Are the lights on? Are you paying people? Are you Twitter and you're shutting down your entire Google Cloud thing today, just to see what happens for fun? You'll know if you've crossed that boundary, pretty quickly. This one down here on the bottom right, the unacceptable workload, unfortunately, we're all a little familiar with this one as well. In order to achieve what your organization's goals are, are you pushing things too hard, too far? Are you burning people out? Do you not have enough people to do the things that you want to do? It's a reality that a lot of us have to deal with. Then the one that we're really talking about in particular here is this big left hand one, the acceptable performance boundary. That's the one where when you cross it, things fall over. You're in some space in here, you have a point at which your system is operating. The other thing that I want to really convince you of here is that you don't know where that is. At any given time, it might be shifting around in this Brownian motion. The only way you really understand where those boundaries are or where you were, is when you cross them. This is the opportunity that you have to learn from these kinds of systems. If you just look at MTTR, you're not going to know why you were here, and now you're here. I want to talk and give some examples of these kinds of safety boundaries, these insights that people have learned from more in-depth sociotechnical analysis of their systems.

Safety Margins

Fong-Jones: It's trading one mechanism of safety for another. We traded off having the disk buffer, but in exchange, we've lost the ability to go back in time to replay more than a few hours of data. Whereas previously, we had 20 hours of data on disk.

Hebert: That's the concept of safety margin. Ideally, we have like 24 to 48 hours of buffer so that if the core time series storage has an issue where it corrupts data, then we have the ability to take a bit of time to fix the bug and replay the data and lose none of the customer information. This speaks to the expertise of people operating the system, and that they understand these kinds of safety margin and measures that are put in place in some areas, when something unexpected happens to people who have a good understanding of where all that stuff is located are able to tweak the controls and change the buttons and turn some demand down to give the capacity for some other purpose. That's one of the reasons why when we talk about sociotechnical systems, the social aspect is so important. Everything that's codified and immutable is tweaked and adjusted by people working with the system.

Fong-Jones: Going through the cast of characters is really interesting, because I think that's how we wound up in this situation. Two years ago, we had one Kafka expert on the team, and then I started doing some Kafka work, and then the Kafka expert left the company. It was just me for a little while. Then the platform engineering manager made the decision of, ok, we think we're going to try this new tiered storage thing from Confluent. Let's sign the contract and let's figure out the migration. We thought we had accomplished all of it. Then we had one engineer from our team sign up to finish the migration right, like already running in dogfood, make it run in prod. Then when we started having a bunch of incidents.

Hebert: There was this transfer of knowledge and sometimes pieces fall on the ground, and that's where some of the surprises come from. It's not something for which you can have quantitative metrics. It's something you have qualitative metrics for, which is, do you feel nervous around that? Is this something that makes you afraid, and getting a feeling for the feelings that people have towards the system is how you figure that one out.

Nash: I'll play one more, which is from incident.io, which is another company in the space providing tooling. It's Lawrence Jones, who's an engineer there. It's talking about patterns and the knowledge they've accumulated over time and the patterns, these instinctual things that they've figured out that only these humans running these systems can recognize, and then how they've been using those patterns and incidents to help them muddle through that faster, then figure out what's happening a little bit better.

This is from an engineer at Reddit. I don't know if anybody remembers the GameStop stuff that happened a while ago, and that Reddit got just absolutely hammered by this. They did a really fantastic thing, they called it an incident anthology. They wrote up a whole bunch of their incidents. They also talked about some of the patterns across these. In particular, they talked about just the process of doing that work and of what comes out of it.

Garcia: In terms of telling these kinds of stories, it serves a purpose that is not really served by anything else that we do, even post-mortems, or even documentation, because this is basically memory. We have memory as people, and we share that memory across people by telling stories. We need to do that as engineers. The only way that we can actually do it with people across different organizations is by writing it up, and having that collective memory grow.

Nash: That's the task at hand.

Key Takeaway

The takeaway from this talk is that it's people that make resilience. It's not this technology or that tool or this piece of automation. It's the collective work that we do together, the knowledge that we build over time, this adaptive capacity for our systems. You can't measure that with something like an aggregate metric.

Questions and Answers

Participant 1: How would you suggest getting teammates, organizations on board with more of these qualitative metrics in companies that are traditionally very quantitatively focused or care more about what was the dollar cost of this incident?

Nash: There's a slide from a talk that a fellow named David Lee gave. He's given it now at a couple of places. He gave it at DevOps Enterprise last year. He gave it at the Learning From Incidents Conference. He is in the office of the CIO at IBM. Not a small company, very heavily focused on quantitative metrics and measuring things. The Office of the CIO is a 12,000-person organization. He's done an amazing job over the last year and a half now, of doing the kind of incident analysis at a really small scale. When he started, they actually did it as a skunkworks project. It was not sanctioned by leadership. There was no big program for it. They'd had some experience at doing this and approaching incidents this way. He started just analyzing a few of them and sharing that, socializing that with engineers who were involved. The engineers who were involved were like, this is really good stuff. It was very much a grassroots, from the ground-up type of way of doing it. I do believe that is the way that this works. I don't think you're going to have a lot of success trying to tell executives, we need to go do this. This is the vegetables in the smoothie way. If you look around, there will be like-minded fellow travelers. You could find them, and then do this on a small scale with them. The value will be demonstrated, and slowly but surely, you can build that. That's exactly what David did in the office of the CIO. They now have a situation where they have a monthly learning from incidents meeting. They have a rotating trained group of folks. One of them presents a case from an incident, it's not always the most severe one, it's the one they think they'd learn the most from, they get upwards of 200 people attending these. At the executive level, they finally got buy-in. The CIO herself would show up with her direct reports, and then suddenly everyone was paying attention. Hundreds of people come to multiple meetings, hundreds of people watch the recordings. Now what he's doing is he's quantifying qualitative data, which you can totally do. Social scientists have been doing it forever. My advice is to start really small, demonstrate the value amongst fellow travelers, and then you can build it and scale it from there.

Participant 2: In our organization we collect metrics about incidents and issues, there has been this thing about the collective memory or you know something that maybe on the back of your head is a blowup, because if someone is not paying attention, and your team is not aware, or if someone leaves with the knowledge. What are the top actions that you derived from the metrics that you ultimately shared the knowledge and improved the way that the teams handle their incidents?

Nash: I really think the cost of coordination stuff is really valuable here, because a lot of times the effort to manage incidents is obfuscated from the rest of the organization. Things like MTTR contribute to that obfuscation. If you start collecting metrics on like, actually, that really high severity or really long incident, it might have only taken two people, and they worked on it. Then there was this other thing over here, and that took 15 people and someone from PR. That's where you start to show to your organization the cost of those incidents. You could probably get the cost. If you're amazon.com, you know how much money you lose if your website's down per second. These are these other associated costs, and you can quantify that too. You can say our incidents are involving this many people. That's one of the top things that I would start focusing on. Then highlighting themes and patterns is a really great way of doing that. A couple of organizations that I know of that are doing that are doing it in a really powerful way. One of the things that David Lee talks about in his talk, is the big initiatives that actually came out of this. They had one where they did end up rearchitecting some. They had another one where they just changed the way people worked together on something. The patterns that you see across your incidents, those themes can sometimes highlight these things that you're trying to make a case for. Oftentimes, it's giving you data for rearchitecting something or fixing something that you as the engineer is like, I know this is going to blow up. If you can use your incidents to show that you've learned that, then now you have evidence.

Participant 3: Did you ever find any incidents where some teams were trying to maliciously gamify those metrics and push the blame to other dependencies, other folks, and not lose their trust.

Nash: I don't really find evidence of that in the incident report. I do know of this as a phenomenon. Any metric becomes a target, Goodhart's Law. We do see this thing happening, or where people will do really ridiculous things to essentially pervert the incentives or gamify the MTTR, make it look like their team's MTTR is better than others. There's also weird, behind the scenes horse trading stuff that goes on. I think that is very much a phenomenon that can happen, especially with these aggregated metrics where they don't mean a lot but you can figure out how to push them around.

Participant 4: You had mentioned near misses. It seems like that would be very difficult to quantify, because the whole point of a near miss is it's a near miss.

Nash: I wasn't suggesting you count your near misses.

Participant 4: How do you turn it into a metric, in a systemic way, actually, other than two people having a chat on it?

Nash: I actually wouldn't recommend trying to quantify near misses. You might be able to quantify the information you get for them. Really what I'm suggesting measuring your misses is getting that qualitative sociotechnical like, "I thought that system x did this, but you think it does that." You can then surface those as themes or patterns. Clint Byrum at Spotify is doing a really good job of this, where they're like, this incident matches the theme of squad confusion, or a misalignment of how things work between this squad and that squad. I think if you can roll things up to like themes or patterns, then maybe you can get to some form of quantification. I'm definitely not saying take all of your near misses as the numerator and your incidence is the denominator or the other way around, and use that as a metric. Please don't do that. Also, you can't count all your near misses. It's a lot of work to even investigate those. I think it's just picking the ones that you have a spider-sense about, that will tell you that. The other reason I like near misses is they don't have all of the trauma. They're successes, but they have all of the underlying stuff of incidents. Oftentimes, you can get these details out of them, especially if you're in a pathological organization, where there's finger pointing. In this case, it's like, you can call up the hero worship stuff a little bit and be like, "Jane figured this out. Why? What was that? What was happening there? How can we take advantage of that? Or what do we need to do to not let that go over the rails next time?"

Participant 4: My point was the fact that near misses means that you missed them and you don't know what they were. Then, when they happen, then you start identifying near misses.

Nash: It might be semantics.

See more presentations with transcripts

Recorded at:

Jan 18, 2024

Courtney Nash

InfoQ Software Architects' Newsletter