InfoQ Homepage Presentations More More More! Why the Most Resilient Companies Want More Incidents

More More More! Why the Most Resilient Companies Want More Incidents

Bookmarks

View Presentation

Speed:

37:23

Summary

John Egan discusses how companies of any scale can improve their understandability by lowering their barriers to incident reporting and simplifying their processes for documenting postmortems.

Bio

John Egan is CEO and cofounder at Kintaba, the modern incident response and management product for teams. Prior to Kintaba, John helped to lead enterprise products at Facebook.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Egan: My talk is titled, "More and more and more: Why the most resilient companies want more incidents." My name is John Egan. I'm CEO and co-founder at Kintaba. We are an incident management product company that makes it easier for companies of any size to practice incident management within their organization in what we consider to be a simpler and more successful manner. Before that, I built tools at Facebook, as the company scaled from about 4,000 to 20,000 people, both internal tools as well as helped to build the external tool, Facebook Workplace, which is Facebook's public enterprise offering. I have a real passion for incidents. It's a space that I think is one of the most fascinating areas inside of our industry in terms of the way that we approach them, as well as how we're starting to adapt the way that we record them, and react to them within organizations, primarily in tech and how we're absorbing a lot of the best practices from other industries. I'm going to get into some of the maybe more controversial ideas around how we might want to track incidents. We're going to talk a lot about why more incidents are better. We'll get to dive into a couple of examples further than that.

First of all, everyone hates incidents. Maybe you have a person or a team internally that likes incidents, but I imagine on average, across your organization, most people don't respond to incidents, being excited. Most people will generally hear about an incident or be involved in an incident and have a reaction, something more similar to what we have with grumpy cat here in terms of just trying to make it stop. I think that's a natural human reaction to have things that are perceived as bad, things that are putting our company at risk, oftentimes, by definition. In fact, with most incidents, our first exposure to them tends to be an incident so large, that we're drawn into it, despite maybe even being on the periphery of it. By definition, our first experience with an incident at most companies tends to be a negative one. This actually has some practical effects on organizations, and the way they approach incidents that isn't so great, I think, and that I think we're going to talk a little bit about countering.

The 'I hate Incidents' Metrics

I call these the 'I hate incidents' metrics. This is the way a lot of organizations we talk to are currently measuring incidents and thinking about incident management. They're thinking about mean time to recovery, mean time to failure, mean time between failures. MTTR the other one, mean time to recovery and repair, as well as resolution. You get this acronym, soup of mean times. There's a bunch of reasons these aren't great. If you want to boil it all down to the 32nd answer of why these aren't the best metrics when you're building your incident management system, it's because ultimately these are vanity metrics. They're metrics that are pretty easily gamable. They're metrics that we can go in and make bad decisions for our company that ultimately produce better looking metrics. For example, trying to push mean times down, a natural reaction is to file incidents later, or maybe not file incidents at all. Another example of these in general is we're encouraging people to work towards an average across a series of black swan events, which is what major incidents should be. Because of that, we're actually ending up with a number that isn't particularly representative of any of the individual incidents that we're having. I'm not a big fan of these in general.

Our Goal Is to Produce Resiliency, Not Just Recovery

The way I like to think about this when I talk to people about, how are you thinking about your incident management system? Why is more, better? Why aren't we just measuring how to reduce as much of this stuff as possible? I like to back up and think a little bit more about the goals that we have within our companies when we're thinking about resilience. Our goal is that it's resiliency. It's not just recovery. Being able to recover is important, and it's something that our teams should be capable of. It's not really the primary thing that we care about long term. What we actually care about inside of companies is long term resiliency, that ultimately results in a reduction of catastrophe, not just incidents overall. When I want to back into this problem from that first principles of what we're trying to accomplish, it's easiest to look out at industry and say, who's actually good at this? Who's good at being resilient? What are they doing that's great? What are the things that they're measuring at a top level, or macro level that makes them really successful?

The Airline Safety Revolution

Whenever I talk about industries that are good at resilience, anyone from within the resilience space, resilience engineering or otherwise, always gets on the tip of their tongue, the same industry. That's the airline industry. This is an industry that over the last about 12 years, has carried more than 8 billion passengers without a single fatal crash. The airline industry really is what we call the crucible, inside of which modern incident management grew. We inherited almost everything that we now know, inside of technology organizations, when it comes to incident management and response, modern incident management response from the airline industry. I think looking at them really closely is an important method when it comes to trying to determine, what's the right way to become more resilient overall? What are the right metrics to be tracking?

I think a lot of people don't realize that when you're looking at this industry, you really actually want to look at recent history, despite the fact that a lot of the discoveries in human factors, blame-free processes, were built out in the '50s and '60s, originally, from an academic standpoint. What really happened is after a series of major crashes in the '80s, and early '90s, the industry made a top level change. They came out and said, we're going to increase the overall reporting of incidents within our organization aggressively. We're going to make those incidents as available as possible to everyone else in the industry. We're going to make a concerted effort because we think our process is good, we think our method of learning is good, but we just don't have enough of these things happening because people are ultimately afraid to file them. More incidents really ended up being the answer. The organization pressed hard and had an order of magnitude of incident filings and reportings. Out of that came the modern industry we know today, which is what we consider to be the safest mode of transport.

Incidents in the Space Industry

Another adjacent industry that we can look at, for starting to understand maybe some of these metrics of more is better, might be the space industry, which is a sister of the aviation industry. In the space industry, they faced a different problem entirely, which was they saw it was working well for the airline industry in general, and said, we need more incidents, but we don't fly these things that often. We don't want to have to collect all of our incidents on each flight, we need to find another way to get more. Organizations like NASA actually reach out and find incidents across industries that aren't theirs, or inside of organizations that aren't theirs. Then they absorb those incidents as if they were their own, and they write NASA postmortems about them. I thought this was so cool to discover. NASA actually maintains a website, where they track case studies of incidents happening outside of NASA, and they write up lessons learned, underlying causes. They talk about them as if NASA had been the one who caused the incident. It was really fascinating to me to see. It's this great indicator of just how critical it is to go and have enough incidents happening inside of your company that they would do this. They would go out and find incidents elsewhere.

How to Get More Incidents

Most of us don't really have to do this. Most of us can look at these initial learnings from these two industries. We can say, we need more incidents and it's not actually a problem within our organization that we don't have enough. It turns out that what we're not doing is we're not recording them. Then you ask, why aren't we recording these incidents that are happening inside of the company when we know that there's a benefit to getting more incidents into the system? I think there's some very simple steps that we all need to take in general. I think this is all just beginning to happen across especially medium sized organizations in the tech world as well as smaller organizations. Even large companies are pushing through this. The biggest one is really this first one, which is, how do we lower that barrier to reporting? How do we increase our number of SEV-2's, or SEV-3's, or near misses? Netflix was really at the forefront of this. How do we log every near miss? How do we make sure that we learn from it? We don't just log it and know that it happened. The immediate action that has to be taken is get them logged, get them recorded because they're happening in your company every day.

The way you do that, the way you get that barrier lower and for people to react to that lower barrier is the second item, which is, you have to simplify the reporting system and process inside of the company overall. You can't have this big mega reporting process of multiple pages of documentation that you have to follow, meetings you have to show up at. Terrifying reviews with managers and people up and down the reporting chain. You've got to make it dirt simple to come out of the gate, and say, a thing happened, let's get it logged. Let's get it recorded. Let's get it put into the system. Let's make sure we learn from it quickly. Then, finally, we have to show through actions the result of those reports. We have to make sure that people can see the changes happening in the company as a result of those incidents being filed.

You can do that two ways. I think the easiest way to understand those two, is to look at this quote from Charles Billings, who was the chief of aviation safety research at NASA, which was this realization that the reason they're able to get people to file these incidents, pilots and unions that were originally against the idea of more filings, is they, A, had a sincere interest in improving safety. Which I think is true across our industry as well. I think most people working inside of tech companies have a desire to reduce the number of major catastrophes. Number two is actually the more important one, which is a sincere belief that what they're reporting is going to be used to make things better, not just used against them. This is pretty critical. If you don't establish that culture, then you won't get more. If you don't get more, then you won't be able to increase anything about the ability of your process to function.

More Incidents Reported = Fewer Catastrophes

I wish I could just say, more incidents reported means fewer catastrophes, therefore, we just need to make this chart happen within our company, and we're good to go. Increase in overall incidents, recorded decrease in SEV-1 major outages, like, it's going to work. That's not really the case. It's not quite that simple. When we say more, we don't just mean more incidents, we mean a lot of things. We can prove this out by taking another life critical industry. We can take the healthcare industry, which tried to do exactly what I'm talking about. They tried to take these learnings from the airline industry and from NASA, and say, what we've got to do is we've got to get these incident numbers up. We've got to get everyone recording. There was a concerted effort really across most hospital systems and healthcare systems, nationalized or otherwise, to increase overall reporting. After this had happened, and been going on for a couple of years, there were studies that were run. On the upside, we found that with those incident reports increasing that litigation claims per bed seemed to be negatively correlated, which made the lawyers happy. However, the hospitals turned out to be pushing back a lot harder than originally expected, in terms of actually filing the reports. They were both afraid to file them because they were worried it would make them look less effective in a competitive environment. They were concerned that it would bring additional regulatory attention to them. That these reports of incidents would be used as ways for governments and regulatory agencies to file fines as opposed to being used primarily to make safety better, as it was being done effectively in the airline industry. Then, worse than that, it turned out that there wasn't really an apparent association between these reporting rates going up, where the industry pushed to get people comfortable with filing at all, and eventually got those filing numbers up. You didn't actually get a positive correlation between that and the mortality ratios.

This is pretty concerning. Your whole goal is you need to report more incidents. I wish I could just say that's the thing you can take away from this entire conversation, but you actually need to take this warning away as well, which is that it's not just about having more incidents. We can go back to Charles Billings again here, same guy from NASA, who went on to actually help the healthcare industry, go and try and work through this challenge of, how do we implement these solutions? His response was, it turns out that the only takeaway anyone got from us trying to push this in was that they just needed to report the incidents. Through reporting, that would be enough to fix all problems. That the fixes would just be generated magically out of nowhere, and then that would enhance safety overall. Sadly, it's not true. There's a whole process here that I think we're all aware of with incident management. I think this process becomes critical for more incidents mattering, because an incident needs to actually kick off an entire flow of response to the incident, the resolution itself. Then importantly, the reflection on the incident and the learning about the incident, and the distribution of that information. How do you make sure that when an incident happens, that it actually gets reviewed, and that you actually change the way your company runs based on it?

More Incidents Is Equivalent to More Learning

It turns out, it's not really just about more incidents. It's really about more learning. It's about that outcome. The learning doesn't happen automatically. The learning actually takes a pretty concerted effort in and of itself. I think we all in the industry have an understanding of what it means to try and learn from the really big failures. We're able to look at them and say, "There's been a major outage. We need to go and write a big report on this. Let's get 10 pages written with every root cause and every contributing factor, and every impacted person. Let's bring everyone in a room and make sure we read it." When you're thinking about a world where you want to maximize the amount of learning, you actually have to think about two other things. You've got to think about simplifying that process for creating learning documents, and then, greatly increasing the distribution of the information.

Make Your Postmortems as Public as Possible

I think the easiest way to think about doing that is to actually go and make your postmortems as public as possible, a couple of these things in parentheses. Because the organizations that are the best at this actually do that. If you go to NASA's website, they actually have a learning center where they've got over 800 postmortems written up, which is just all of the things that they've learned over time. Google publishes their postmortems publicly, generally. I think partially because the public demands it. I think ultimately, it's turned into a really powerful process for them, to make sure that they write consumable accounts of what happened that can be consumed by anyone in industry. Ideally, people even outside of it. Cloudflare does the same. Amazon does the same. Writing these things up and publishing them as publicly as possible, turns out to be almost the thing that's more important, but is really a follow-on effect, of having written more incidents is you have that process in place. We write more incidents. We follow that process. We aggressively publish those postmortems out publicly. We write those postmortems in ways that they can be consumed, even if that means writing them simpler.

Key Takeaways

The takeaway has changed a little bit from just being, write more incidents, make more incidents happen, over to this world of, we need to make it easier to file the incidents. We need to make it safer to file those incidents. We need to simplify the process so that you're writing more postmortems, so you don't feel afraid of going into this gigantic and terrifying template that maybe you don't want to fill out, or you don't want to distribute, or you don't want to have to go and research. Then we have to distribute those learnings out to as many people as possible within the organization so that they all read it. When we talk about more, we're really talking about all of this stuff. The trigger for all of this working is to file more incidents, and to get comfortable with the idea that you want to see a metric there increase. You want to see an increase in incidents being filed in your organization. That's a positive thing. You're not really quite so concerned about, for example, the amount of time between incidents. If anything, you want that to decrease. You want more of these things being filed. If you do that, then you have an opportunity to be kicking off a process which hopefully is well defined in your organization. If it's not, come check out Kintaba.

When you have that well-defined process, you'll be writing up these learnings quickly to follow up the incident that happened, no matter how small it was. Make sure to come in and write. I have a blog post called the 4-second postmortem, which is all about writing one sentence in the simplest case. Even in the slightly more complex case, you might come in and write a half page postmortem or a quick paragraph on changes we need to make. Then, finally, you have to have a concerted effort to distribute those learnings out to more people within the organization. Just putting it into a Google Drive is not enough. Emails need to be sent. Sites internally need to be maintained that potentially carry all of this. There need to be events and there need to be meetings and there needs to be a part of your culture within the organization that says we care about reading these things. We talk about them after they're read. We distribute them widely across the organization, not just within the SRE team. We distribute them everywhere, such that the whole organization can learn and improve our resiliency as we go forward. I think if you do all of those things, if you focus mostly on increasing your practice of incident management, as opposed to just focusing on decreasing the here and now of the active incident, you'll get a lot closer, I think, to putting this guy up on the moon.

I think that's really what we're all thinking about when we're practicing incident response is moving our thoughts from that grumpy cat of, how do I get this to never happen, like to stop? To, what practices and processes do I put in place to make sure that this doesn't happen again, in a catastrophic way? That's my talk about having more incidents, and more reporting, and more learning, and why more is better. Why the most resilient companies in the world practice finding as many incidents as they can, within and without their organizations.

Writing Easily Consumable Postmortems

There was a question here about writing easily consumable postmortems being time consuming, as well as difficult to do. It's not always easy and time consuming, so who in the org do you suggest author the postmortem? I think really one of my points here in this conversation about lowering the barrier in general, is that you should have a spectrum of different types of postmortems that are being written for simple incidents. Incidents that you're maybe capturing down on a SEV-3 level, maybe the near miss. You might not have to have a multi-page complex technical response, you can probably have a little bit more of straightforward, immediate takeaways. Things that you would learn if you were in-person, maybe even in a hallway walking past someone and saying, we need to never configure things that way again. We need to never roll our scripts serially or in parallel. I think that's a longer conversation within the organization. When you're lowering that barrier, you really want to focus on how do we make the easiest postmortems easy, so that folks who would otherwise write them and contribute some piece of information back as a learning to the company, don't shy away from it out of fear of a multi-page, complex document that has to be put together.

Questions and Answers

Butow: One of the things I was wondering about, too, is, you had some really interesting information about NASA. How they write up postmortems for incidents that aren't even from NASA. How did you find that out? Did you find out anything else interesting when you were looking into this?

Egan: I spend a ton of time just trying to dig into postmortems. I always do this. I'm a bit of a postmortem junkie. There are some resources online with them. There aren't really too many exhaustive ones. It's a lot of Google searching. It's a lot of rat holing around what companies seem to be doing it, and then finding it. The NASA documents are actually on what used to be, I think, an employee-only page that was opened up more broadly. Unfortunately, I actually just tried to share the link to it, and I think they might have locked it back down over the last week or so. I'm going to poke around and see if maybe they've just moved it to that lessons learned site, because I can't imagine they would intentionally try to hide those. They're super valuable. I'd never seen companies doing that before. NASA is really the first time I'd seen an organization follow on so aggressively to the idea of, we've got to have more of these reports because that's the only way we learn. We don't necessarily want to generate them through our rocket launches, or through things that put human life at risk. I think the closest thing we have to that in the tech industry is probably maybe the chaos engineering space, where it's like, let's generate these incidents. Let's just make them happen, so that we can get more of these learnings. It's still direct from within your organization. I think when gamedays are run well within organizations to try and produce other incidents, they often don't look like the things that affect necessarily our infrastructure. They might look like, take down a third party. Take down a data center because of a sandstorm. I think that's maybe close to it. I'd love to see more companies sharing, how would we have interpreted this outage inside of our company, and what would the learnings be here versus necessarily waiting for it to happen to them?

Butow: That's a really interesting idea. I love that idea of maybe everyone say to a few companies, let's all write a postmortem for this specific incident or inject this specific failure and see how we handle it, like one type of third party going down. That's really interesting.

There's a correlation between an increase in observability and incidents being reported, should a tech team focus on better observability over introducing new processes? What's your thoughts there?

Egan: I think in general, these things all produce outcomes that lead towards this general goal of how do we get more of these incidents filed? They capture different types of incidents. It actually turns out at most companies, that the vast majority of incidents are filed by humans. It tends to be a human decision to say, these observability metrics, or general trailing metrics, or otherwise, are actually valid to the point that we need to go and address them directly. I think what observability does really well, is it causes accountability towards things that ought to be incidents very quickly, because you have public availability of the information. If you have more humans seeing that data, it's more likely one of them is going to make that call that we need to go and declare an incident versus trying to sweep it under the rug or hide it.

Then I think on the other side of that, the worst incidents often are ones that you don't have any visibility into. Where the indications are that our customer inbound support requests, or a conversation happening in the hallway about a near miss that's about to go on. I think you've got to attack it from both angles. It's one of the reasons I really think up-leveling when you're talking about metrics away from just, how quickly are we responding once the incident happens? Up-leveling that out to say, "No, we're really just doing whatever we can to get the incident count up." I do think it has you look at tools like observability. It also has you look at the human factors of how do we change that culture and simplify that process so we're not afraid to file or avoiding, which is even worse, filing incidents because we don't want those numbers to go up.

Butow: An interesting comment in regards to the number of incidents that you report. It says, I think that more traditional companies worry about customer's perception about the number of incidents. What's your thought there, if you do have to report these to customers?

Egan: There's certainly that moment of change within a company from going from, we don't really talk about these things, to we talk about these things. I think if you're going to be public, which is the ideal situation about the increase in incidents that are happening, I think you should be public about that adjustment of practice as well, towards the statement of we're going to start being public about it. We're going to talk more about it in a way that you're going to hear from us. I think what's important there is to be really clear about your delineation between major, minor, near miss learnings. I think NASA is a great way to point back to. NASA calls these things learnings, first and foremost. That's what the goal is. I think if you phrase it externally that way, the public is much more accepting of it. You talk about here are learnings from this past month, here are learnings as a company from this past year. It's just a different attitude, I think, to take towards it, versus here are the things that are going badly. Here are the things that aren't working. That shift, I think has to happen both internally as well as externally. I think we're seeing larger companies doing that more, not just because they have to talk about them, but because I think that it actually garners them a growth in community from the technical space. Cloudflare has done a pretty good job of this, of being public when they have trouble even before it's publicly reported. Other companies that do it as well, I just think benefit from it. Historically, they've only talked about them inside of maybe their technical communities. The more open we get about that the more we're going to be able to both be internally more aggressive about reporting, as well as externally.

Butow: That's an interesting idea as well to be able to share our incidents. You mentioned that you do think it's improved over the last few years. How do you think we can really blow that open and just see a lot more public postmortems being reported from the tech industry, because maybe in the end, NASA would be looking at our postmortems to learn more, and that would be pretty cool.

Egan: These cultural changes that happen across organizations always happen in little spurts. I think we had the beginnings of this over the last five years or so. I'm seeing a really interesting transition happening from postmortems being written and filed away, to postmortems ending up on Hacker News right away. I imagine there's an increasing desire within companies to get those postmortems published more officially from themselves to those public social channels more quickly. I think that's the next stage of this progression is how do we start to track these things across companies? Where do we all go? Right now it's a list of websites. We go to the NASA website. We go to Cloudflare's website. We hit Hacker News. Maybe there's a Reddit, subreddit. I really think that's the next change that we're looking for is, where does this community grow in terms of that general acceptance. Once that happens, you then get orders of magnitude propagation right out across the technical communities of acceptance. There's a desire to do this more often, because your company isn't really operating if you're not also publishing your learnings. Stack Overflow got us there from a social standpoint. You can build up karma from the perspective of producing learnings and sharing them around as an individual. I don't really think there's an equivalent yet for companies, and there probably needs to be.

Butow: It's interesting, too, it makes me think about what we could learn from maybe the security industry, how a lot of companies will work together, have groups where they come together to talk about recent vulnerabilities. Then they also report on vulnerabilities, so you can just stay like an RSS feed, or join a mailing list to see everything that's happening. Do you think maybe we could learn from the security engineering space, for just broader incidents as a whole?

Egan: I do. I knocked on the medical industry a little bit. The medical industry has been doing the true version of postmortems, actually after death types of research in their conferences for decades. Where when there is a life lost, they'll have conferences where all the specialists will come together, and they'll actually do a general wide scale review of why is that working? Why didn't that work? What didn't you do right? What should we change about best practices? I think that that practice really could propagate out from the security industry as it has propagated up from them out to the wider industry, so that we have more of these shared learnings instead of siloed knowledge inside of companies. I would love to see more FailCon style conferences that focus on that. The startup industry has done this forever. When startups go under there's been a series of conferences that have been, come together, talk about why it didn't work, and that community learns from it. I think, similarly, we can learn from that across organizations, within companies, and probably actually more effectively.

Butow: What's your suggestion on a template for postmortems? I know you've got to think about this a lot working at Kintaba, creating Kintaba as a company after working at Facebook. What did you really take away from that to say, this is what we really need to have?

Egan: I struggle a lot with that question, because it's turned out that the adoption within a company of a single postmortem is actually a little bit antithetical to the idea of trying to increase the number of incidents filed. Because the template for your postmortem defines the end state that people are going to get in after they have an incident, and that influences whether or not you're willing to file the incident. Where we're going inside of Kintaba, and we're not there yet, we're a young company, but where we're trying to run towards is this idea of multiple types of templates that go everywhere from as simple as a one sentence, what did we learn here? Where you would make up for the fact that maybe you're not walking in a hallway anymore, in remote work, and you don't get to walk past the person who was involved. They say, never roll Kubernetes that way again. All the way out to the much more complex templates, which are from folks like Google that they've published. There's a pretty decent Google postmortem template on their public SRE book that digs into the way they do it, which focuses really heavily on, what happened? Where was our root cause? What went wrong? What went well? Then, critically, where did we get lucky? Which is a really important section that I think a lot of folks will occasionally overlook inside of postmortems, which talks a lot about this wasn't our process working. This was like something just happened to work out. Where we got lucky is the super important section, especially for more complex incidents, because it's easy to overlook those things as that was working versus that just happened to work out. It's critical to go through that section when you're trying to remediate into your processes to make sure you're more resilient, long term. To be able to say, we got lucky here, and here. Let's codify that stuff, so we don't have to rely on luck happening again.

In terms of templates, that would be great. Definitely that Google template, internal to your company. You also should really track readership of postmortems, and see the ones that get read the most. In a lot of ways, writing postmortems and knowledge distribution is like an SEO problem. It's like, how do we go internally and get people excited about this stuff and reading? Then model your future postmortems after the ones that do get read, after the ones that are written well. Because you have an audience that's not necessarily mapped to everyone else's audience. Your audience in your company, if you're primarily a small SaaS business selling to a couple of major enterprises, is going to look very different than the Google major scale tech postmortem.

Butow: Do you have top three tips for how to write a postmortem that people want to read, that you can share with us?

Egan: It's like writing a newspaper article is the way that I think about it. Lead with the important data, because people might only read that first sentence. Lead with the most important takeaway in that first sentence. Make sure you include a section about what you learned as the person with the most context. It's a critical part of writing an incident report is you're the only one who was there, everyone from the outside is going to have a different set of hindsight context into the incident. If all you do is get primary learning number one into the first sentence, and then what unique perspective you had as the participant in it, in your second section, I think you'll actually get 70%, 80% out of what most people get out of postmortems.

See more presentations with transcripts

Recorded at:

Dec 03, 2021

John Egan

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?