BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Incident Analysis: Your Organization's Secret Weapon

Incident Analysis: Your Organization's Secret Weapon

Bookmarks
37:40

Summary

Nora Jones discusses how to move faster and focus on the things that matter by using incident analysis.

Bio

Nora Jones is the founder and CEO of Jeli. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations reach crucial availability, having an audience of ~40,000 people, helping kick off the Chaos Engineering movement. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Jones: Incident analysis can actually be a tool for unlocking your organization's success, for collaborating better together, for working better together, for being more productive. It can be a key piece to your resilience, providing things that coding and tooling cannot. We've all had incidents, and the good news is that it means you have customers. I know it doesn't quite feel like good news at the time, but there's a lot that we can learn from them. A lot of ROI that we can get from them. They're not going to go away. That's not what this talk is about. It's not about preventing them or getting to zero incidents. They can improve and get better over time. They can feel better. They can feel like not as big of a deal. When we do have incidents right now, there's inevitable questions that creep up sometimes afterwards, especially when they're really bad. Like, what can we do to prevent this from ever happening again? What caused this? Maybe we're wondering why it took so long to fix, especially if we weren't involved in the incident, but maybe it had affected us.

Why Do Incident Reviews and Postmortems?

My organization took a while to study why people in the software industry do incident reviews or postmortems today. Here were some of the most common answers that we got. I'm honestly not sure. Management wants us to. It gives the engineer space to vent. I think people would be mad if we didn't. We have some obligations to customers. We have tracking purposes. Maybe we want to see that we're improving over time and getting better. Maybe our boards are asking for it. We're not sure. We did come to the fact was, most folks in the software industry agree that post-incident reviews are important. We all know that some form of this review is important, but we don't all agree on why it's important. Yet, even though that we know it's important on some level, we don't always make efforts to improve it. We don't always make efforts to improve the process. We don't allocate the time needed to do that. A lot of the time, we don't know how. Incident analysis can be trained and aided to be improved upon. There are a ton of benefits to that. It has to be trained and aided to be improved upon. That time needs to be given. It's different than coding time.

Activity: Cultivating Curiosity and Leveling up Learning

I want to do a quick activity. Part of doing an incident review is coming from a position of inquiry. Why did something make sense? What information is missing? Who can I ask to find out more? We're going to take a few minutes to review an RCA. The purpose of this is not to call out bad examples. I want to see if this satisfies your curiosity. Here's the RCA in question. We have here the event date. We have when it started. When the incident was detected. The time it was resolved. We have our root cause. Our DNS host scheduled a normal reboot for security patches, resulting in a simultaneous outage of DNS servers on the environment. As a result, all the connections between all hosts in the environment failed due to DNS lookup issues. They provide some more info. They share some of the customer impact. Then they list some of their action items.

Personally, my first question is, who is this written for? Wonder what the audience is here? Is it the customer? Is it fellow colleagues at the organization? I wonder how they came up with these action items. A lot of action items around fixing playbook, or updating documentation, that could be said about any incident. I wonder what value this is providing the organization. What questions do you still have about this event? Does this report answer all your questions? For me, it doesn't answer all my questions. If I was a new employee at this organization, or if I was on a sister team of sorts, I would want to be knowing a lot more about how they came up with these action items. I would want to learn a lot more about what a normal reboot looks like? I would want to learn a lot more about why we're putting some of these details in here.

Post-Incident Reviews are Important, But They're Not Good

I want to come back to this. I think we all know that a lot of the post-incident reviews that we do today are important, and we need to do them. We need to go through this learning. From what we've asked of a lot of people in the software industry was, yes, they're important, but they're not good. What's worse is, when we have an incident that is deemed to have a higher severity. When it's deemed to have been really bad when maybe it hit Twitter, when maybe it hit the news. We actually end up giving our engineers less time and space to understand what actually happened to resolve them. There was an incident quite recently that was very big. That incident was blamed on a single engineer. We usually give our engineers less time to resolve these incidents, because we have our SLAs with customers that we need to uphold. However, with those incidents, specifically, it's important that even more time and space is given even after that customer SLA is met. It takes more time and space to figure out what happened in those.

Quantifying Incident Reviews

That past RCA we looked at, there were a lot of numbers in there. I know we want to quantify a lot of our incident reviews. I want to quote something that John Allspaw said back in 2018. "Where are the people in this tracking, and where are you?" The metrics that we're tracking today like MTTR, and MTTD, and incident start, and incident end, detection, gives us some interesting data points. What are we actually learning from it? What is the purpose of it? What is the purpose of recording those metrics? Allspaw posed an open question and challenge to us. He said, where are the people in this tracking? We haven't changed much as an industry in this regard. Gathering useful data about incidents does not come for free. It's not easy. You have to give time and space to determine it.

I'm going to talk about why giving this time and space to your organizations can actually work within your favor through multiple stories. New paths on how you can do it and ways that aren't disruptive to your business. Next steps for you to embark on. There's a spoiler alert here, which is, sometimes a thorough analysis reveals things that we're not ready to see, hear, or change. It can be a mirror for our organization. The actual work involved is making sure it's safe to share these things so that we can actually improve.

Background Info

I'm Nora Jones. I've seen these things from the frontlines as a software engineer, as a manager, and now as I'm running my own organization. I've written two books on chaos engineering with colleagues from Netflix. I'm a student at Lund University in human factors and system safety, where I partner with other students from across the world and across other disciplines that are safety critical. In 2017, I keynoted at AWS re:Invent, about the benefits of chaos engineering and my experiences implementing them at jet.com and Netflix. I then went to Slack. Recently, I started a community back in 2019, called Learning from Incidents in Software, where over 300 people in the software industry are open sourcing and sharing their learnings from incidents and how their organizations are improving. It is a group of people trialing out new ways of incident analysis that are unhappy with the status quo today. Which led me to forming my own organization, Jeli, based on a need I saw for the importance and value to the whole business of a good post-incident review. However, I also saw the barrier to entry to getting folks to work on that.

Performance Improvement = Errors + Insights

I want to talk about performance improvement. A lot of folks measure incidents and measure reviews and want to quantify them because they want to improve performance. Maybe performance improvement looks like having less incidents. Maybe it looks like not having customers notice them. Counting errors is one way to measure whether or not we're improving performance. However, you're not going to get the performance improvement that you desire just from counting errors, you also have to generate insights. You can't improve performance unless you're doing both of these, both sides of the equation are important. However, as an industry, we tend to focus a lot on the error reduction portion, almost too much, and honestly, to our detriment. We're not going to get the improvement we're looking for unless we're also measuring and disseminating these insights. This part's hard. We're not trained on this as software engineers. We're numbers driven. We like code. I'm going to talk about this insight generation.

Objectives

At the end, you'll be able to understand some different approaches to incident analysis and how it can generate more insights and improve performance. I'm going to explain some of the social and how that improves the actual technical parts of our engineering. We're going to learn how asking questions in a different way can impact insights gleaned, and disseminated. We're going to assess appropriate message for distributing findings and sharing this learning, sharing these insights. I'm going to do this through a couple different cases and stories. I'm going to tell you three different stories about the value incident analysis has brought about in different organizations. These are true stories, things I have witnessed, but names and details have been changed.

Story One: Getting More People to Use Chaos Engineering Tool

When I was at Netflix, I was on a team with three other amazing software engineers. We spent years building a platform to safely inject failure in production to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering, like injecting failure or injecting latency. It was amazing. We were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots. There was a problem to this. The problem was that most of the time, the four of us were the ones using the tooling. We were using the tooling to create the chaos experiments. We were using the tooling to run the chaos experiments, and we were analyzing the results.

What were the teams doing? They were receiving our results, and we were talking to them about it. Sometimes they were fixing what we found, and sometimes they weren't. We wanted to look into that. Why were we the only ones using this? Why was that a problem? The biggest benefit to chaos engineering, and the reason it exists, is because it can help you refine your mental models. You can say, I didn't know that that worked that way. I didn't know that this was critical. I didn't know how this impacted our customers. The fact that the developers of the chaos engineering tooling were the most frequent users of it means that we were the mental models that were getting refined in this situation. We weren't on-call for these systems. There was a big benefit of this missing. We weren't the ones whose mental models needed refining or understanding. Yet, we were the ones getting that refinement and understanding.

Sometimes teams would use this tooling, but that would only last for a couple weeks, or during a high traffic period or right after a big incident. Then we would have to remind them to use it again. What did we do? We did what any software engineer would do in this situation. We thought, we'll just automate the hard steps. We focused a lot on the tooling, the safely injecting failure, to being able to mitigate the blast radius, which was great. We weren't focusing enough on what chaos experiments we needed to create, and how important the results were. We weren't giving teams enough of this context. We wanted to give them context on where they should experiment, and why, and how important a particular vulnerability found within the system was important to fix or wasn't important to fix.

How did we start to do that? To know if something was or wasn't important to fix, we started looking at previous incidents. We started digging through some of them and wanted to find things like systems that were under the water a lot, or people that we relied upon a lot, or systems that were going down that resulted in an incident which was a huge surprise, or incidents that were related to action items from previous incidents. We wanted to use this to bubble up and help folks prioritize things and give them context, and feed back into the chaos tooling. However, through the process of doing this, we found that looking through incidents and looking at these patterns and studying them and learning from them, had a much greater power than just helping the organization create and prioritize chaos experiments better. Spending time on it opened my eyes up to so much more. Things that could help the business beyond just the technical, beyond just helping this chaos engineering tooling.

Here's the secret that we found. Incident analysis is not actually about the incident. The incident itself is a catalyst to understanding how your organization is structured in theory, versus how it's structured in practice. It exposes that delta for you. It's a catalyst to understanding where you need to improve your socio-technical system and how people work together. Because when something's on fire, all rules go out the window. You're just trying to stop the fire. It's a catalyst to showing you what your organization is good at, and what needs improvement. We all have thoughts around this, but the incident actually exposes them. It's a missed opportunity if we don't look into it.

Story Two: It wasn't just 'Human Error'

This reminds me of story number two. This is a separate organization where an incident had occurred at 3 a.m., which is when all the bad incidents occur. I was tasked to lead the investigation of this highly visible incident, after the fact. A senior engineering leader had pulled me aside in the office the morning after and said something along the lines of, "Nora, I'm not sure if this incident is all that interesting for you to analyze." It seemed pretty interesting in my eyes: 3 a.m., all-day mitigation, and it hit the news. I asked, why? He said, "It was human error. Kieran didn't know what he was doing. He wasn't prepared to own the system. He didn't need to answer the page, it could have waited till morning." I was baffled. This was in an organization that thought they were practicing blamelessness without a deep understanding of what blameless meant. When something like this happens, a Kieran makes an error, it ends up usually being met with instituting a new rule or process within organizations.

There was nothing wrong with this particular organization. This could have been a story from a lot of organizations in the tech industry. We have something we thought was human error. Let's make a new rule or process so that that doesn't happen again. We're going to create a gate. That actually is still blame. It's unproductive. It's not only unproductive, it's actually hurting your organization's ability to glean those new insights that lead to that performance improvement. It's also easier to add in rules and procedures. It feels good. It allows us to move on, emotionally speaking. It says, yes, checkbox, we did something about this problem. It's not going to happen again. However, these implementations of rules and procedures don't typically come from the person that's in the hot seat. They don't typically come from folks on the front line either. That's because it's easy to spot errors in hindsight. It's much more difficult to encourage insights. Unfortunately, adding in new rules and procedures actually diminishes the ability to glean new insights from these incidents.

Despite all this, despite being told not to investigate this incident, they were just going to create a rule. Have a little post-incident review meeting, and move on. I still wanted to talk to Kieran. I wanted to talk to him one on one and figured out what happened. I came into this conversation knowing that Kieran received an alert at 3 a.m., that had he spent more time studying the system, he would have known could have waited until business hours to fix. I started the conversation by asking him pointedly to tell me about what happened. He said, "I was debugging a Chef issue that started at 10 p.m. We finally got it stabilized and I went to bed around 1:30 a.m. At 2 a.m., I received an alert about a Kafka broker being borked." That led to interesting finding number one. Kieran was already awake and tired, and he was debugging a completely separate issue. Those two systems were unrelated. I wonder why we have on-call engineers, on-call for two separate systems in the middle of the night. I wonder why the other system didn't get handed off to someone when he had already been responding to this one page.

I asked him what made him investigate the Kafka broker issue. He said, "I'd just gotten paged for it and my team just got transferred this on-call rotation for this Kafka broker about a month ago." That's pretty recent. I asked him if he had been alerted for this before. He said, "No, but I knew this broker had some tricky nuances." That led me to interesting finding number two. Kieran's team had not previously owned this Kafka broker. I wonder how on-call transfers of expertise, work? I wonder why his team got transferred this system? I wonder how often they had been working with this Kafka broker. I wonder if anyone had actually ever owned this Kafka broker before. I asked him then how long he had been at the organization. Five months. Interesting finding number three, he was new to the organization. He was on-call for two separate systems in the middle of the night. He was tired. I think if I was in his shoes, I would have done the exact same thing. This was not a Kieran change we needed to make. This was not a new rule or procedure we needed to make. This was an organizational change we needed to make. This organization, based on these results, and based on talking to Kieran was able to make significant changes about how they structured on-call ownership, and how people were on-call for it and responded to systems in the middle of the night. How we trained new employees. We wouldn't have been able to glean these insights without actually studying this information further. This is what leads to performance improvement.

Cognitive Interviews

Post-incident reviews are important, but they're not good. If we can ask some deeper questions, and we can talk to some of the people that might commonly be blamed, we'll find some of the answers that we're looking for, and we can make them good. The technique I used with Kieran is called cognitive questioning. It's something that industries other than software use quite frequently, and they're industries that we can learn a lot from, that have been around a lot longer than us. Cognitive interviews can determine what folk's understanding of the event was. What stood out for them as important? What stood out for them as confusing, ambiguous, or unclear? What they believe they knew about the event and how things actually work that they believe others don't, revealing these things as data. If I had just had Kieran in a post-incident review meeting, and I had not had that one on one conversation with him in a way that made him feel psychologically safe, I wouldn't have gotten all those insights. A more senior engineer would have probably spoken up and explained how all of it worked. While they may have been technically correct, I wouldn't have gotten the data about how Kieran viewed the system, about the problems he was experiencing. None of those things would have been exposed and they wouldn't have been fixed.

Cognitive interviews can also point to a lot of interesting things to continue exploring, like relevant ongoing projects, past incidents that feel familiar, or past experiences. It can end up building up expertise. You can iteratively inform and contrast the results of these interviews that you do with other sources of data like Slack transcripts from the incident, Jira tickets, pull requests, Docs. It's useful to start with generic questions like I did with Kieran. "From your perspective, can you explain what happened in the incident?" You want to make them feel like the expert here because they did have an expertise in that moment. How did you get brought into this event? Some questions I like to ask are pointing them to a specific thing that they said during the incident. What did you mean when you said this? Can you tell me about the ordering of events? You said this thing didn't look normal, what does normal mean? At what point did you discover what triggered this incident? What dashboards were you looking at? What was your comfortability level with assessing the situation? How long have you been here? How have you seen things done previously? Are you frequently on-call for stuff like this? Is anyone on-call for stuff like this? How did you feel this went? Giving them this open forum can reveal a lot of things, but you have to be able to ask questions in a certain way without finger pointing or suggesting your expertise on the situation as well.

Story Three: Promotion Packets Were Due

I was in an organization that was having upticks in incidents in certain times of year. I was asked to study why and how certain times of year were leading to more incidents. I started looking around at what was happening during certain times of year. I noticed they were around the time that promotion packets were due. Promotion packets in this particular organization were done by line managers. They would put together a packet of all the cases about an employee they thought deserved to be promoted. They would use references like, here are all the things they said they were going to complete at the beginning of the quarter and here are all the ones that they did complete. As this organization grew larger this became very numbers driven. Folks didn't have time to read this entire packet, they would just use this checklist. Did they complete these items? This was in a detriment to the organization.

When the engineers would set out to do their OKRs, at the beginning of the quarter, things come up. Priorities change. They have to reevaluate. What we would see right before promotion packets were due, were engineers trying to hurry up and finish all the things that they had set out to do at the beginning of the quarter in order to get promoted. It was a system that was inadvertently created. Everyone was doing this at the same time. If we hadn't looked into these incidents, and we hadn't reviewed what was happening around the incidents, what else people were dealing with, we wouldn't have realized this incentive structure we created in the organization was actually leading to incidents. It was a catalyst for incidents. By doing some thorough incident reviews and studying, we were able to change and make appropriate changes to how we were doing promotion packets.

A good incident analysis should tell you where to look. It's hard to do incident analysis. It's hard to do cross-incident analysis if we don't have good individual data as well. I'll show you a quick screenshot. Where would one look here? This is some message volume after an incident. This is an actual incident, natural event that took place. You might look at the volume of chatter in certain places. You might look at when folks were getting alerted a lot. You might look at when conversations were going on over video conferencing. You might be interested in tenure of folks and how long they've been here. How often were we relying on super senior folks? You might be interested in the absent chatter. You might be interested in which teams were involved, and their participation level. It seems like customer service was the only one on, on Friday night. I wonder if they were getting the support they needed. You might be interested in which days of the week were happening, and which teams were participating the most. This can tell you where to look and can help you ask different questions. It can help you understand what folks were dealing with in the moment so that you can understand the cost of this coordination.

It might be surprising, but a good incident analysis can help you with things other than learning. It can end up helping you with headcount. We really need to make a team out of this rotation. It can help you with training. It can help you with planning promotion cycles, with quarterly planning, with unlocking tribal knowledge. With how much coordination efforts during incidents are actually costing you, and understanding your bottlenecks in people.

The Components of a Strong Post-Incident Process

There's components to a strong post-incident process. Here's what we recommend, through our experience doing this as well. You have an incident. You assign it to someone that didn't participate in the event so that they can ask the open-ended questions and people can give responses without feeling like they need to impress one of their teammates. You identify interviewees and opportunities. You analyze the disparate sources of data. You talk to people. You align with participants on what happened in the event. You facilitate a meeting, and then you generate an artifact. Not every incident deserves this entire process, but the big ones do. It's not just the big ones. There's a lot more incidents that should be given more time and space to analyze. You can consolidate that process for the smaller ones.

Here are some examples of incidents that you might want to give more time and space to analyze. If there were more than two teams involved, especially if they had never worked together before. If it involved a misuse of something that seemed trivial, like expired certs. None of us have ever had those. If the incident or event was almost really bad. If we went, "I'm really glad no one noticed that," we should analyze it. If it took place during a big event, like an earnings call. If a new service or interaction between services was used. If more people joined the channel than usual.

Improving Incident Analysis

When are we ready for incident analysis? When are we ready for a non-template driven, one page post-incident review. Having customers, means you're ready to benefit from incident analysis. It builds expertise in your organization and actually helps people collaborate and trust each other more. What can you do today to improve incident analysis? You can give folks time and space to get better at analysis. This can be trained. You can come up with different metrics. Look at the people, which teams did we have involved? This doesn't mean pointing blame. It's just figuring out how we're coordinating and collaborating together. We can come up with investigator on-call rotations, so that we're not just having people from the team involved doing the incident investigation, because we're going to gain more ROI from it. The benefit of that is a person from another system is going to learn a little bit more about how another system works. If we do a lot more of that, everyone's expertise is going to level up. Also, allowing time for investigation of the big ones is extra important.

How to Know Incident Analysis Is Working

How do you know if it's working? How do we know if we're getting more insights generated? If we're getting these insights disseminated? How do we know that performance is improving in a way other than counting errors? If more folks are reading the incident review. If more folks are voluntarily attending the incident review without being required, maybe folks on teams that weren't involved, adjacent teams, sister teams, customer service, marketing. If you're not seeing the same folks pop into every incident. If people feel more confident, ask that confidence question. If you see teams collaborating more, proactively asking each other questions. If you're seeing a better shared understanding of the definition of an incident. If we're not debating the SEV levels all the time at the beginning of the incident, not wasting time doing that.

Here are some actual feedback pieces that folks have gotten in previous organizations after instituting a deeper incident analysis. These are not quotes from Jeli, these are quotes just around the industry of improving incident analysis. I just changed the way I was proposing to use this part of this system in a design as a result of reading this document. Never have I seen such an in-depth analysis of any software system I've had the pleasure of working with. It's a beautiful educational piece that anyone who plans on using this should read.

Key Takeaways

Taking time to understand and study how an incident happened, can generate more insights and improve performance. It can actually improve how you understand the system as well. Understanding the social of the incident can improve the understanding of the technical. Asking questions in a different way, and sometimes this means one on one, can impact the insights that we've gleaned.

Resources

Some further resources on incident analysis. The learningfromincidents.io website has plenty of blog posts from folks that are actually chopping wood and carrying water, and doing the real work to implement incident analysis in their organizations. If folks are interested in how counting errors can actually be a detriment after incidents, I would encourage you to read, "The Error of Counting Errors" by Robert L. Wears, from the medical field. It's a two-page paper and it's excellent in explaining some of this. Also, the Etsy Debriefing Facilitation Guide is an excellent review of how to ask different questions after incidents, in ways that can generate more ROI for you.

Questions and Answers

Rettori: How do you define the threshold for when to do an incident or not? Is it a money impact? Is it a time impact? What are the learnings that you have on that area?

Jones: It's hard, especially if you're in an organization that is having incidents every single day, which I know a lot of folks in the software industry are. There's near misses. There's incidents. There's always reasons you have to jump into a channel to fix a thing. How do we decide which ones are important to do this on? I would say ones that look like anomalies for your organization. Maybe if a lot more people joined the channel than usual. If there were just a few people responding, but a bunch of people lurking, or if it was generating high interest in general, if it involved something we thought we had fixed before. I think it's good to come up with some criteria in your organization. If there's just interest in general in like what we can learn from it, that's usually a good signal that you should do an incident review. I would recommend only just not doing it only for the ones that hit the news or the big ones, or the ones that leadership tells you to, because those are the ones that ends up being emotionally charged. You're not actually getting learning out of them because you're focused so much on prevention. If you take one of the ones that are more simple in nature, you can start to get that learning out of them faster and build up that muscle. Then, once you do have a big one later on, it's a lot easier for it not to be a witch-hunt.

Rettori: I think maybe a little bit touching on how to encourage the actual application team or the developers to be involved in the incident reviews. Because, in my experience, and you've seen this, like it's a separate team that does that, and they have a process. Most often they do, because it needs to be done more than they need the learnings from what happened.

Jones: I think there's a couple things in mind. It's who does the incident reviews, who facilitates them, and who attends them. I think if you have a group of folks that are not actually writing code for the systems, doing the incident reviews all the time, it's a huge miss for the organization. Because what happens is, if you have people that are writing code, and participating in the systems, doing the incident reviews and switching off, you're leveling up the expertise of everyone in the organization. If Dio is on the search team, and I'm on the frontend team, and Dio does an incident review for my incident, he's going to learn a lot about how frontend works. Vice versa, if I go and do an incident review for the checkout team, I'm going to learn a lot about how checkout works. Over time, everyone's expertise builds up, and people start to understand a lot more of the system other than just their part as a whole. It's beneficial to actually have that incident reviewer on a separate team so that they can ask the silly questions. It's not for them. It's just an added benefit of that. It gives people the space.

 

See more presentations with transcripts

 

Recorded at:

Dec 09, 2021

BT