BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations How Did Things Go Right? Learning More from Incidents

How Did Things Go Right? Learning More from Incidents

Bookmarks
42:14

Summary

Ryan Kitchens describes more rewarding ways to approach incident investigation without overly focusing on failure prevention.

Bio

Ryan Kitchens is a Site Reliability Engineer on the Core team at Netflix where he works on building capacity across the organization to ensure its availability and reliability. Before that, he was a founding member of the SRE team at Blizzard Entertainment.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kitchens: This is how SRE world looks like. This is our world right here. We're starting to figure out all that SLO and SLI stuff. When a team blows their error budget, you probably get into a war room. Maybe you hold an incident review to figure out what happened. Then, you have to assess whether those SLOs are really appropriate and what your tolerance for failure truly is. SLOs and Error Budgets may have become the most important items in the SRE's toolkit. For SREs, it's transformed how we think about prioritizing reliability, and we've created this way that helps us keep steering teams back onto a happy path.

SLOs have created a framework to help us realize just how important failure is. In fact, failure is so important that it's no longer interesting. If all we're ever doing is engaging with teams when they're blowing their Error Budgets or having incidents, we're missing out on this whole other world of opportunity. We need to start realizing that the other side of failure - success - is at least half the story and we've been ignoring it.

My name is Ryan Kitchens, I work on the CORE team at Netflix. We are responsible for ensuring availability and we focus on the incident life cycle, that's from responding to on-call to driving the incident review process, and all the learnings that happen thereafter. We've heard folks like John Allspaw, talk about some of this stuff before, but my team at Netflix is really implementing it into real practices. I want to start by highlighting some principles that set up the place we need to be coming from in order to learn more from incidents.

Maybe they're more like opinions, but they're totally correct. That's how opinions work. What I want to get at, isn't that anyone is doing it wrong, but rather that we can step it up, dive deep, and practice more effective ways that go beyond failure prevention because failure is the normal state. We've heard Dr. Richard Cook, talk about how the surprising thing isn't that our systems sometimes fail, but that they ever work at all. What if I told you that the reason you're having incidents is because you are successful? You're going to have incidents; you can't choose not to.

Here's the thing - we're pretty good at preventing them. All those times we aren't having incidents, that's called normal work and it's successful most of the time. We're never going to be able to prevent them all, and it's too bad. One of the biggest tropes out there is how can we be more proactive and prevent this from ever happening again? That's not helpful, it's superficial. The word "proactive" is so often lip service. It may sound counterintuitive, but overly focusing on preventing incidents will end up preventing fewer of them in the long-term.

Preventing incidents is going to happen. we do it every day. Don't pretend to focus on that as your goal, the goal is to learn. The most important thing we can learn is how to develop the capacity to encounter failure successfully. We should be getting to a place where it's ok to say, "That's exactly the incident we want to have." No action items necessary.

Here's a secret. This is something that Chaos Engineering helps us do. Once all the low hanging fruit is gone, once your detection time is down to seconds, once you have a way to fail out of multiple regions and remediate in a matter of minutes, how do you think about getting proactive, then? The nines don't matter if users aren't happy. Thank you, Charity Majors, for talking about this. I want to take it a little bit further to say also, availability is made up. Even the margin of error that goes into capturing, it's bigger than the next nine you're trying to report on. If you're spending any significant effort, creating availability reports for your product, I think you should probably stop. The availability you have is probably fine because we don't define success by the absence of failure. It's the presence of adaptive capacity. It's the ability we have to cope with the things that are changing around us continuously.

There is no panacea. We know we can't just do DevOps, or start calling people SREs, or put SLOs on everything and expect it to solve all of our problems, except for expertise. The game changes when you start to realize that the people on your team are the world-class experts for your systems. Right now, you work with the best people on the planet for encountering the problems you're going to face, and they're already preventing them every day. Imagine what can happen when you start to take them seriously on this.

Remember that region failover thing we do, that remediates all our problems? What do we do when that fails? More importantly, are we even paying attention to what happens during all those times it doesn't?

There's one more really big idea I'd like to see more of us advocating for because it absolutely changes the conversations we have about learning more from incidents. There is no root cause. The problem with this term isn't just that it's singular or that the word "root" is misleading. There's more - trying to find causes at all is problematic. And Groot cause - debatable. Going and looking for causes to explain an incident limits what we find and learn. The irony is that root cause analysis is built on this idea that incidents can be fully comprehended. They can't. We already have a better phrase for this idea. It's called a perfect storm. In this way, separating out causes and breaking incidents down into their multiple contributing factors, we're able to see that the things that led to an incident are either always or transiently present. An incident is just the first time they combined into a perfect storm of normal things that went wrong at the same time.

Three Pillars of Fallacy

I want to highlight a few pillars of fallacy in coping with complexity: comprehension, understandability predictability. In a complex system, none of these are entirely possible. Incidents cannot be fully comprehended, they're fraught with uncertainty. Often, we find that the remediation items coming out of incidents, contribute to further incidents. This realization is a big deal, action items won't save the day. There are organizations out there who require high severity action items following every incident. Really think about the consequences of that.

You're not going to avoid big incidents just by dealing with a bunch of smaller ones. It sounds intuitive, but it just doesn't work that way. logs filling up on a server are probably not going to take down your whole business. Is anyone thinking "I'm not so sure about that."? I should let you know, my team's hiring. Sometimes we do that stuff on purpose.

The little stuff doesn't cause the big stuff. Suggesting that causes are the keys to understanding an incident is misleading. Causes are something we create and construct afterwards. They are not primitives that makeup incidents. They don't fundamentally exist to be found.

You're never going to be able to predict the next incident no matter how much you learn from the previous one. Later today, Jason Hand is going to talk about the trouble with learning in complex systems, so be sure to attend that talk. The state of the system is changing around us all the time no matter how immutable we try to make it.

Laplace's Demon is this person who knows the precise location and momentum of every atom in the universe. Complex systems are not deterministic. Why do we think we can foresee all of the consequences of our actions? Why do we think we can maintain an accurate mental model of the system? Why do we think we can avoid every mistake? "Why?" is the wrong question.

Ask More "How" Than "Why"

In hindsight, so many actions look like a choice. "Why did you do that," can make people feel blamed. Instead, what we want to do is enable people's stories, draw out descriptions of what they did, and facilitate in what ways it made sense at the time. You get that by asking how more so than why. The "5 whys" is limiting. I'm not going to say you're doing wrong, I'm not criticizing anyone who uses the technique, it's useful, but because this ["There is no root cause"] is true, you're going to miss stuff. Each why we go through dismisses more and more available information and it can lead us to think we've totally understood an incident by following this path of why things went wrong.

Here’s an example of this. An engineer was deploying a hotfix. They bypassed all the safeguards and they ended up breaking stuff even worse. When we asked them, "Why did you deploy it that way?" The answer we ended up at was, "I felt rushed and stressed, and I probably should have slowed down." Following the whys, we kept hearing a lot of self-blame, and we were missing what those sources of pressure were about. When we asked them how they went about deploying it, they told us a story about coordinating on-call, diagnosing the problem, and how the changes were implemented.

Then, we as investigators started to realize, "Oh, no. This person was both the incident commander and the only person who knew how to fix the problem." Instead of getting at what they thought they should have done, we started to get to the point of how this actually happened and what conditions allowed it to.

At this point in our field, we got to kick it up a notch. There's no one weird trick that's going to work. We need to start realizing that the other side of why things went wrong is understanding how things go right. I've seen organizations where management is incentivized to drive down the number of failures. This is a fool's errand. Trying to prevent all failure isn't as realistic or effective, as successfully minimizing the consequences of that failure. Eventually, you're going to max out on how much you can do to prevent things from going wrong.

At some point, the complexity of the system won't let you, and if you try, you're only going to introduce more peculiar modes of failure. We don't improve our performance or increase our success only by preventing things from going wrong. We do it by making sure that things go right. People are doing this all the time, every day during their normal work. I'm going to run us through this model of how we can end up at either path, success or failure.

This scary-sounding stuff called performance variability. That's the reason behind how everyday work is safe and effective. It's also the reason why things sometimes go wrong. What is it? People adjust what they do to match the situation. It's the improvising and workarounds that people are doing so routinely that they probably don't even notice it. Think about all the times you're making small configuration changes or adjusting thresholds, dealing with auto-scaling. Through understanding the variability of everyday performance, we start to see that the things that contribute to failure are the same things that contribute to success. People are doing the same things in both cases.

You may recall this helicopter rescue in the French Alps. This happened back in January and it was awesome. The helicopter flew right up to the mountainside into the snow. People jumped on the helicopter, they flew off, and everything was fine. I realize this might not be normal work for most of us, but for rescuers, it is. It was a pretty big success, but let's imagine it wasn't. In the case of an accident, we usually expect this deep investigation that could unearth all kinds of stuff about what happened and how it happened. Since everything went fine, we ignored all that. It's pretty hard to ensure that things will go right next time if we aren't learning how things go right in cases like this. When we have an incident, and the response is more testing, higher guardrail, stricter process, we have to ensure that we aren't limiting the opportunity for our expertise to come through. Constraining performance variability, in order to remove failures, will also remove successful everyday work.

What I'm trying to say is, what if the real success was the incidents we made along the way?

Success or Just Incidents Along the Way?

This is how people talking about the new view or Safety II, tend to model the problem. We tend to focus on that small amount of time we run into failure. What about most of the time when things go right? Just to see that this holds true in reality, this bar is a few years worth of actual availability data from a real customer-facing service. It's pretty close. Why are we talking about failure all the time, if all it is is this teeny tiny bit of red? If we want to focus on learning more, we need to be considering all of it.

We love to talk about data. Learning from how things go right uses most of the data available. This gets us back to the pitfalls of the nines. Emailing them around to feel good is really a lie, and it doesn't capture everyone's use case of what that availability is because everyone cares about a different part of that availability. It's almost like people just want to know that it's out there somewhere and somebody, thank goodness, is thinking about it. Next time publish that number in a Google Doc, something people actually have to take effort to click on to get to. Then, check the analytics on it and see how many people actually looked at it.

What's more effective, is to think about and to explore all the different dimensions that go into creating that number. There are a ton of qualitative measures that are worth way more than the percentage point that they represent in the nines. If availability is made up and the nines don't matter, then putting incentives around those numbers is worse than meaningless, it's parasitic. It's not making you better. To repurpose a phrase I've heard in security recently, I'd also say that availability is perceived. Why point this out? Why am I harping on this?

It's because that green part of availability that we call success, that huge portion of stuff we've been ignoring, it's at least half the story and it's often invisible. This is fundamentally what we are up against. I can't tell you about the incidents we didn't have. Measuring stuff, like the number of incidents or the probability of failure, that stuff we're generally able to do. How things went well, the probability of succeeding, we usually don't try to understand all those things that happen during all those times we aren't having incidents. We might want to think about, how do we keep normal work normal and keep incidents weird?

Criteria for an Incident

That starts to bring in the question, what's the criteria for an incident? This is what it looks like when you file one in a ticketing system. Who here tracks the severity of their incidents or regularly reports on, how many Sev 1 major incidents you have? Just like reporting on availability, I think we should stop doing this. Those broad categories like Sev 1, major, minor, they don't help surface patterns across incidents.

It's not good enough to get up at a quarterly business review meeting to report, "We had 10 Sev 1 incidents this quarter, and here's how it compares to last quarter." That doesn't tell you anything. It doesn't gauge customer perception and that sentiment focuses too much on preventing incidents. You're not getting better just because fewer incidents were reported. A lack of incidents isn't evidence of success.

The gist of this incident here is that the website got a burst of traffic, users saw errors, and people were unable to sign up during this time. That sounds pretty straightforward. How do we get more out of it? We can start by making the description a little more explicit. what did you actually do when you got paged? How did you find out that signups were down? Did you get an alert? Did you have to dig for it? What's a death spiral? Other people might not know what that means. Customers saw errors, but what was the impact of that on your customer service team? What did they see?

How to Approach Incidents

One thing we tend to forget is, how was the incident actually mitigated? What happened to that burst of traffic? How did we escape that death spiral? This is an outline of what my team goes after during an investigation. We have Nora Jones to thank for coming up with a lot of this and my colleagues, Lorin Hochstien and Ashish Chatwani have also done a ton of work here. My team runs a workshop to teach people how to approach incidents in this way. I'm just going to give you, the crash course overview of what this is. I want to point out that this is not a template. We're trying to move people past the form fields, and the required text boxes, and the dropdowns. We really want to get people started in telling a narrative story and then use this to provide some structure around how to dive deeper into those particular sections of the story that they want to pull the threads on.

What we want to do is articulate the conditions that allowed the incident to happen. This is what we mean by contributors and enablers. Start by taking everything that you think looks like a cause and separate it out, and then explore each and every one of them. Talk about the properties and the attributes that were present. You're going to unearth so much more this way than trying to walk back a chain of events of this led to that.

Mitigators is like what kept the incident from being worse than it was. Things can always be worse. This is probably the hardest part because of the way that success is often invisible, so get everyone calling out all the stuff that they saw work, that helped stop larger problems. The thing to point out here is not that when we talk about what went well, that's not to pat yourself on the back for doing a good job. It's about discovering sources of resilience that would otherwise be invisible if we didn't talk about them.

We think of risks as more general things that create danger. Some examples of this would be unbounded cue links or bulk updating things globally. In the case of this incident, here's what we can turn those into. There are about eight items per section here. Just to zoom into these, you don't have to read them, I just want you to get a feeling of the magnitude of this. In our investigations, each one of these entries has a paragraph to go with it. Sometimes they spark even further investigations. We could have stopped it, "Auto-scaling took care of it and things recovered on their own," but there were all kinds of things going on here. These seemingly simple incidents often make the best examples of how to do this work. Even out of a small incident, we can create big conversations, change the way that teams interact and surface things that weren't on anyone's radar.

This last one, unexpected performance from upgrading to new instance types, upgrading instances was one of those action items from a previous incident that contributed to this one. We really do see this happen all the time. Difficulties in handling can be things like, did we have to page in that one particular person because only they know how it works? We call this the "Islands of Knowledge" problem.

Were there any problems collaborating? We might not realize until later that we're not aligned on what a large blast radius is. We may end up affecting many more customers than we would want to.

Our action items are not really the point of doing all of this stuff. The more preparation you put into all of this, the more you'll be able to inform really good improvements at the end of it. You don't want to come into an incident review with a laundry list of action items. That means your incident review happened somewhere else. Fixing stuff - if someone thinks it's important to fix, they're going to fix it whether or not you write it down; a fixer is going to fix. Here's a different way to think about it. Can we identify actionable outcomes? Is there a different place we want to get to, a different situation we want to be in, in terms of the long term strategy here or did we unearth some significant new information that would allow us to go back and revisit big design decisions?

Artifacts are the output of an incident. Documents, reports, tools, dashboards, saved queries, command line history, consider the long-term velocity of these. How can we use these for education? How can we get this stuff informing other teams' roadmaps? It's hard to assess this in ROI terms, a lot of effort goes into creating them, they live for a really long time and hopefully, people are doing stuff with them. To find out what, you can read J. Paul Reed's Lund University thesis, this was recently published. It's a very good read and talks all about artifacts.

I'd also like to see some better timelines. What if we could explore timelines or highlight observations, how working theories played out? Rather than this two-column table, what if our timelines looked more like this? If you can make something like this interactive, timelines will become a point of engagement instead of just an appendix at the bottom of a document that no one's going to read. References are "Be sure to link out to all the stuff that happened." This is the code changes, pull requests, commits, code reviews, deployment time, system changes, graphs, anything that can help add more context.

Be sure to get the easy questions out of the way early. the more preparation you do, the deeper your discussions are going to be. People's first reaction to seeing a lot of this is "I don't have time to do that." Start to think of incidents as unplanned investments. The more you do it, the better you're going to get, and you'll start to realize which incidents are more juicy and are worth going after in this level of detail. We have a team of people doing this work and we don't go after all of them this way. I hope that's a fairly practical breakdown of what my team does to get more out of incidents.

You Have to Talk to People

What about when we don't have any incidents at all? Have you ever thought about what's going on when it seems like nothing is happening? Are people just sitting there? They must be doing something. How hard are people working just to keep the system healthy? You can't detect this stuff with availability data, or number of incidents, or Error Budgets.

There's an irony here and it's the big blind spot of SLOs. A question to managers would be "How do you think about feeding this back up the chain, or do you not because everything looks good?" You can have awesome availability, all of your metrics can be green, and a team can be right at the edge on a path to burnout, and we'd never know if all we're doing is engaging with them when they're blowing their Error Budgets or having incidents. This is an incredibly expensive problem. I don't mean just money costs. This is how churn happens. It's how good teams dissolve.

How do we sort through it when all we have are weak signals? I think it involves facing the hardest problem that's probably ever existed in tech. You have to talk to people. There's no getting around this. "Gosh, how much stuff have we built just to make it so that we don't have to talk to anyone to do our jobs?" Get out there and interview people. Maybe we should call it something different; interview sounds time-consuming. "Maybe we could chat, I don't know if I have time for an interview." You also don't want to look like you are FBI or Internal Affairs. It's not an interrogation, we're not making a murderer. Try to get out of the conference room, go on a walk with people or a grab a cup of coffee together.

There's also this thing called Learning Teams. This comes out of the human and organizational performance or the HOP movement. You can read Andrea Baker's work for more on this. The basic idea is, you take some event - it can be an incident, but it doesn't have to be - and you assemble a working group of people from different levels across the organization, and you bring them together to answer one question. What should our organization learn from this event? Building a learning organization is a real way to be proactive. That's how you actually prevent incidents. How do we do that? We start by asking better questions. I'd start with, "Are our people ok? Did you feel supported? What were you really worried about during the incident? What could have happened? What do you think could stop that from happening? Is that enough?"

Going back to deployments, we could ask questions like "How do you usually gain confidence in these kinds of changes? Do you do code reviews? Are there canaries? Is there progressive delivery to production? Was this a typical change? Did you use a pipeline? Was it a one-off? Were there any workarounds involved? What was preferable about doing it that way, in this case?" We learn so much more when we start to understand how a course of action seemed reasonable at the time they were taken. It's easy to see what happened after an incident's triggered. It's much harder to uncover the events leading up to one. It sure is quiet before an incident, like maybe nothing's happening. The things we are adapting to, those little problems that we're solving all the time, the effort we put into that gets hidden by the very actions we take to ensure that they don't become problems, until one surprises us.

How We Respond Is Important

How we respond to that surprise is really important. "Be more careful" is never helpful advice following an incident. Carelessness or risky behavior doesn't cause incidents. It's only afterwards that we're even able to judge it that way. Best practices are not real practices. When we're interviewing people, I don't want to talk to only the people who are doing risky things and breaking stuff. I want to talk to people who are doing risky things all the time and not breaking stuff. What makes them so successful? That's where you'll find your real practices and learn about how things go right.

It can be difficult to deliver this message with compassion and sincerity. You're always going to have someone asking "Have you tried not writing bugs? What if we just didn't have errors? Why don't you just do this other thing?" Don't vilify them. It's better to demonstrate how inefficient those kinds of comments can be. Here's some of the material out there to help us with this. "The Field Guide to Understanding 'Human Error'" is a great intro for anyone, and this paper, "Language Bias in Accident Investigation." If you don't think that language matters and the words that we use influence how effectively we learn, this paper will change your mind. If the hardest problem in tech is talking to people, then words are like the second hardest problem. There's one word I would like to change. Incidents are surprises. I think we should all start calling them that.

Takeaways

To review the takeaways here, moving the focus towards recovery is going to help you think smarter about prevention. Causal explanations of incidents limit what we learn. Stop thinking about causes and start thinking about contributing factors. Measuring availability is pretty useful, but reporting on it for your product, facilitates some pretty nasty incentives that will harm your availability in the long-term.

Finally, learning about how the system works is just as important about finding out how it fails. It's easy to see how things fall apart, it's so much harder to see how they worked. Discover those sources of resilience in your organization that are actually making you better and try to make your reports about that stuff.

The resources and the references that make up this talk can be found at this URL, which is jobs.netflix.com. Please follow me on Twitter, I'd love to talk to you more about this.

Questions and Answers

Participant 1: One of the things we run into is that a lot of those causes and organizational decisions that are sometimes much earlier in the stack or what really create the incidents, that we find that people have little patience in hearing about those things. How do you address that?

Kitchens: I think software, in particular, as an industry, has very little patience for learning, particularly this deep investigative process. If you look at something like the NTSB with hundred-page reports, you're lucky to get a software engineer to read a six-pager. Going back to that idea that incidents or unplanned investments, making the space for people to invest in this, the killer feature here is for people to be able to make these write-ups discoverable, searchable, publish them, share them out, build up a community around it, because you have a responsibility and that the leverage in learning is all about how people engage with it. If you just file it away, then you've dropped a ton of it on the floor.

There's definitely a mindset you have to build up and find champions who want to move forward with this work. At Netflix, we have a team of people hired to do this specifically. It really depends on where you are in your organization, what level of effort you're already putting into incidents. It's hard for me to give you something prescriptive because the organizational context is different for everyone. Participant 2: I got the impression from some of the things you talked about, that this might lead to not just technical solutions, but also to process and maybe even management solutions. I was wondering if you could talk about some of what that looks but also, have you had any resistance to those types of changes and how do you deal with it?

Kitchens: To repeat the question, some of the implications of what I said are not just technical solutions coming out of incidents, but also maybe organizational ones. Have we faced any resistance in trying to make those changes? Netflix is an interesting case because of freedom and responsibility. It's all about providing context to people at Netflix, and that is not the case in every organization. It really depends if you are very hierarchical - if your manager says jump, you jump, or you have some broad objective that you're trying to hit? What I would suggest in those cases is try to find those patterns across incidents, come up with ways to present them as themes for quarterly work, or OKRs, or that stuff that people can contribute to, and talk to leadership about the reason you think these are important items. You have to consider it holistically. The pitfall of having a bunch of action items out of an incident is you have this, giant backlog of stuff you're never going to get to, so how do you try to think about this a little more strategically? I wouldn't say that the implications are that you think you shouldn't reorg, but it may just be you highlight some pretty significant communication gaps or where there's a difference in pressure between product and reliability work. Making those big rocks that people can work toward, I think is a good way to approach that.

Participant 3: It seems like the crux of the talk is trying to figure out how things go right. You were talking about, when you're doing your investigation, looking for mitigators, things that kept the incident from being worse than it would have been otherwise if those mitigating factors weren't present. Are there other techniques that you're doing in the absence of incidents to try to discover how things go right?

Kitchens: In the absence of incidents or maybe no, mitigating factors to really point toward, are we still able to find the ways that things go right. The simplest thing to do is look for near misses. Start with that. Things that people were "That really was surprising. I didn't know it worked that way," or have people report bugs that they think are novel or significant, and where just simply solving the bug is not really the meat of it. It was everything they learned in approaching it, where those steps that they had to struggle with are things that can be generalized and abstracted so that other people resonate with that story. Looking at near misses is a really good way to do that. Another way, going back to this idea of learning teams, you could use it in a way that's a bit like a pre-mortem thing, where maybe a bit more like a game day, you pose a problem to people and you work through it, and you map out different aspects of what the problems are, and where the gaps in knowledge could be, or where you might have only one person who is responsible for something, and once they've left the organization, for example, that tribal knowledge is lost, and people have to start from first principles, again, to pick up the work that they left off. I would approach it from that direction, which requires a lot of imagination.

 

See more presentations with transcripts

 

Recorded at:

Oct 09, 2019

BT