InfoQ Homepage Presentations Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

Culture & Methods

Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

View Presentation

Speed:

45:01

Summary

Vanessa Huerta Granda looks at real-life examples of companies she has worked with who chose to invest in improving their incident programs and have seen it pay dividends.

Bio

Vanessa Huerta Granda is a Solutions Engineer at Jeli.io helping companies make the most of their incidents. Previously, she led Resilience Engineering at Enova. She has spoken and written on incident metrics, sharing learnings, and in 2021 co-authored Jeli’s Howie: The Post-Incident Guide.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Granda: We're going to talk about some incidents. Incidents prevent us from meeting our goals. You could have any goal that you want, maybe your goal is to get all of the Taylor Swift fans to make it to her concert. Maybe get folks home for the holiday seamlessly, or get goods shipped across the world, or just get people to watch Succession, or Game of Thrones, or whatever it is that HBO is doing nowadays. Sometimes things happen, and that prevents us from meeting our goals. Incidents never happen in a vacuum. For all of these high-profile incidents, we can usually highlight a number of similar experiences, or other incidents. We can also highlight a number of items that led to them happening the way that they did. For the Southwest outage, it goes back to decisions made decades ago when the airline industry was first being deregulated. My first job out of college was actually at an airline. All of these things are dear and near to my heart. In order for organizations to improve and to achieve their goals in spite of these incidents, there needs to be investment made on their end, investment in a culture of resilience.

Twins Analogy to Incidents

I have 17-month-old twins. They were born a bit early, like twins usually are. If you've had children, you probably know that while you're at the hospital, everything is just like perfect. We gave birth in Chicago and we sent them to the nursery. I got to rest. I even got to watch The Bachelor actually. Then we took them home. I have been in incidents for many years. I can tell you that this was like the hardest major incident of my life. Every night for the first few weeks, they just would not sleep. If one slept, the other one was crying. It's like, they were just playing tag team on me. We tried troubleshooting them. It was like that incident that just fixes itself at some point. At 9 p.m., they start crying, and then at 5 a.m., they were like, ok, now we're cool. Basically, if we think of our lives, if we think of our goals, at that time in my life, my goal was to just enjoy my children, maybe keep a home, maybe get to eat, and then eventually go back to work. These first few weeks, I was lucky if I even got to shower. We were sleep deprived. We were doing our best, but it just kept happening until, and I'm really acknowledging my privilege here, we started being able to get on top of it. We were able to invest our time, our expertise, and really our money.

First, we needed to get better at the response, like at those actual nights when things were hitting the fan. We tried a few things, even though we were sleep deprived. We tried some shifts. We tried formula. We invested in a night nanny, and that was the big one. With this, we were able to start getting some time for ourselves. Being able to sleep actually cleared our heads for us to understand what was going on. We could have stopped here, but I didn't have unlimited money for a night nanny. With some of that extra energy, I started trying to understand what was going on. I talked to some friends. I read some books. Every day, we would hold a post-mortem, just while eating yogurt or whatever it is that we could get our hands on. We would start applying those learnings into our lives. We fixed some of our processes around swaddling. We realized that Costco actually delivers formula and diapers, and so that's a big-time savings not having to get two car seats into your car in the winter in Chicago, and then drive to Costco. That gave us a bit more energy, more bandwidth. We could start looking at trends. Like all incidents are different, all babies are different. We realized there were some times maybe when one parent could do it on their own, and the other one gets to clean up or cook. There were times of the day when we both needed to be there. We were doing our own cross-incident analysis, and we started coming up with some ideas. Some of them were a little out there. Like, let's just move them to their own bedroom. Let's just drop the swaddle because she clearly hates the swaddle.

Investing time and effort into the process, then into the post-mortems, and then doing that cross-incident analysis gave us the bandwidth to meet our goal. By the end of my maternity leave, I was able to go back to work. I was able to cook and shower. Really, I was able to enjoy my life and enjoy my new family. Live with them is always going to bring challenges. By living on this culture of, let's just keep learning, let's just keep trying, let's be resilient, we know that we can pivot and do the things that we enjoy. That's basically what we're going to talk about. This is the lifecycle of incidents and how we can become resilient, how we can learn from them. How we can get better as a system so that we can accomplish our goals.

Background

I'm Vanessa. I work at jeli.io. When I'm not wrangling adorable twins, I spend a lot of time on incidents. I've spent the last decade working in technology and site reliability engineering, focusing on incidents. My background is on FinTech actually, in the airline industry. I focus on the entire lifecycle of incidents. I have been the only incident commander on-call for many years. I have owned the escalation process. I have trained others. I have ran retrospective programs. Most importantly, I've scaled them. That's actually the most difficult but also the most important thing to make sure that it doesn't just live with you, with one person. While I do this for fun, because I enjoy it, I also know that the reason why I get paid to do this is because having a handle on your incidents allows your business to achieve their goals. In the past couple of years, I've had the chance to work with other companies, helping them establish their programs. I've seen how engineer teams are able to accomplish their goals by focusing on resilience.

The Mirage of Zero-Incidents

Let's go back to this picture. We will never have zero incidents. If we do, that means we're not doing anything, or nobody's using our products. I believe that incidents are a lifecycle. They're part of how a system gets work done. First, we have a system, say Ticketmaster, and then something happens. Maybe Taylor Swift announced her world tour. Tons of people try to access the system, the system fails, and now you're in an incident. People work on it, and it's resolved, and the site is back up. Life is back to normal. As you can see, though, we're not even halfway through the cycle here. Ideally, after the incident is done, you do something that is focused on learning. Even if you don't think you're doing any traditional learning, people naturally like to talk to each other, people like to debrief. Your learning might be just like getting together, grabbing some coffee with your coworkers after the incident. Or it can be like an actual post-mortem meeting, retrospective, whatever you want to call it, your 5 why's. Or it can be like your incident report, whatever you want it to be. Those learnings are then applied back into the system. Maybe we make a change into our on-call schedule, or the way that we enqueue customers, or our antitrust legislation. That is the new system. That's what we're working with.

The Negative Effects of Incidents and Outages

Why do we care about this? We have this whole track here, resilience, talking about outages. We care about this because incidents are expensive to companies. If we think of the cost of the incident, there's the incident itself. Like, no, we lost money, because the site was down for like 20 minutes. There's the reputational damage. Like, no, I wasn't able to get these tickets. I wasn't able to fly home. I'm just never going to fly this airline again. There's the workload interruption. If you have an incident that lasts an hour, and you have 10 engineers, that's time that they're spending on this incident. Then when the incident is over, they're not going to go back to their laptops and like, I'm going to work on my OKRs. That just takes a big part of your brain, that takes some time to get back to what you're doing. It has an impact on our goals and our plans. If our engineers are fighting fires, they're not working on new features. What we see is that unless we do something about it, people end up getting caught in a cycle of incident where there's just no breathing room to get ahead.

Incidents are part of doing work, but that doesn't mean that we can't get better at them. If engineers are constantly fighting fires, it's going to impact the way that they deliver and the speed at which they deliver it. If our customers are constantly seeing outages, it's going to impact how they interact with us. It's going to impact our goals. As the company, you got to make money. If your customers aren't using you, then that's a problem. There's good news about incidents, I mentioned this earlier. Having the means that somebody cares about your work, that it matters to them whether you're up or down or that you're doing the thing that you say you're doing. I think that's the thing about tech. Our users don't care about the language we use or the methods we use, they care about being able to rent cars or get their money or get tickets. We owe it to our users that they're able to do these things.

The Importance of Resilience in Incidents

Here's where resilience helps. It's the capacity to withstand or to recover quickly from difficulties, from your outages, from your errors, from your incidents. Resilience can help us turn them into opportunities. Most of the time, people don't care about resilience. What usually happens is that you have an incident, you resolve them, and then you move on. We have Chandler Bing here saying, "My work here is done." Not because they want to. I've met tons of engineers. My dad's an engineer. I'm an engineer. We're always trying to fix things. Sometimes we just don't have the bandwidth to do anything else other than move on. That's how we get stuck in this cycle of fighting incidents. There's a better way. I have been lucky enough to work with and for a number of organizations that are leading the way in improving resilience in the tech world. The better way includes a focus on three things, a focus on the incident response process, a focus on learning from individual incidents, that's like your post-mortems, your incident reports, just chatting with people outside the war room. Then, macro insights. You don't have to do it all. You can if you want to. Often, it's hard to find an organization that's going to say like, yes, go ahead, spend all your time doing all of this. There are ways that you can start doing this, one by one. I will go through all of them throughout this talk. First, some caveats. Being successful at resilience is not easy. It's not easy. It's not fast, and it's not cheap. A lot of this requires just like selling this new way of working with resilience as a focus. Selling isn't our best skill as technologists. Maybe instead of selling, we should call it, presenting our data and our work and making a case for it. The good news is that I have seen this work, and we can definitely get many small wins along the way.

A Focus on Incident Response

Let's do our first area of focus. What does it look like to focus on incident response? When you focus on incident response, you're focusing on these three things. You're focusing on coordination, collaboration, and communication. It makes sense to start here. This is the thing that we are already paying for. We're already spending time in outages, so we might as well focus on them. When it comes to coordination, write up your current response workflow. How do folks come together to solve a problem? Look for any gaps. Where are things breaking down? What can be done to make those gaps even just a tiny little bit smaller? When it comes to collaboration, how do you get people in the same room? Is that room Zoom? Is that room Slack? Is that room Google Meet? How do you know who to call? Is it the same people every time? Maybe it's that one expert that wrote the thing 10 years ago. You should examine your on-call rotations and your expectations. Again, what are the little things that we can do to make the lives of those on-callogists a little bit better, just a little bit easier? Then finally, communication. How do you communicate to your stakeholders what is happening? How do you tell your customers? What is hard about it? What are they asking for? Do your engineers know what your users care about, or are they going like, Kafka. When really people are caring about like, can I get approved for my loan or not? Write up some loose guidelines to help manage expectations for folks both inside and outside the immediate teams that are responsible for responding to incidents.

While it makes sense to start an incident response, why aren't we perfect at it yet? Why aren't we putting in the effort into getting better at this? There are a few reasons for this, including that we need to train folks on the process. I think, oftentimes, we get a new engineer, we give them the pager, and then they say good luck. Every organization has different needs and different ways of working. When we need to get our services back ASAP, there are certain procedures that make sense. You want to teach folks the skills that help the incident move forward, that urgency. That like, don't come at me with like, "If only we had done this, if only we had done that?" That idea of hierarchy versus roles. You don't want people during an incident to focus on hierarchy, even though that's how they're used to working. Thinking about the roles that makes sense for incident response, like who is the person in charge of this incident? Who is the person that's in charge of communicating? Who are your stakeholders versus who are your responders? Do they need to know different information? Do they need to act differently? Additionally, many of the changes in this area of focus will require specific tools and specific automations. Sometimes you have these tools that lead to the highest cognitive load, and these things can be automated, but maybe teams don't have the right tools for this, or they just don't have the bandwidth to create these tools to develop them if they're stuck fighting fires.

Here's where I quote my friend, Fred Hebert. He actually wrote a great blog post for The New Stack about Honeycomb's incident response process. They basically are growing as a company, like a lot of our organizations are, and they were running into some issues with their incident response process. He said, "While we ramp up and reclaim expertise, and expand it by discovering scaling limits that weren't necessary to know about before, we have to admit that we are operating in a state where some elements are not fully in our control." Fred was trying to balance two key issues. They're trying to avoid an incident framework that's more demanding to the operator than solving the issue. We've all been there, where you have these runbooks that they're so long, and it's like, this is taking me away from actually resolving the thing. You don't want your runbook to just take so much time and so much effort that it is keeping you from resolving your issue. You also want to provide support and tools to people who have less on-call experience, and for whom these clear rules and guidelines actually do make sense. I've seen this throughout my career. If I'm an experienced responder, I don't need to read all of those things. I don't need to follow all these procedures. If I'm trying to onboard somebody, like they're not going to be able to read my brain. We worked together on a process that automated certain tasks, used our incident bot. Some tasks that can be automated, like creating an incident channel or communicating status updates to stakeholders. Doing that leaves the engineers the time to focus on the actual engineering part.

At Honeycomb, as well as other organizations, the idea is to get quick wins, like just weeks into the process of trying to fix your incident response. You can do this by restructuring on-call rotations, making sure that the page is going to the right responder. Automating some response tasks, like I mentioned, creating Slack channels. Automating how you update some folks. This allows us to spend less time on the task that demand our attention during incidents, but aren't necessarily engineering tasks. Because you never ever want to automate the creativity and the engineering skills of responding. There are things that we can do to just reduce that cognitive load during the high stress times. Improving little things, it's going to help you build momentum and get buy-in around making larger changes, both from leaderships as well as the folks holding the pagers. If you think back to my example of my children, getting those small changes of like, let's switch to formula, allowed me to get some sleep so that then I could work on the bigger changes. A focus on incident response can help improve the process, leading to a better experience to your users and engineers. Leads to spending fewer time on repetitive tasks, easier onboarding for on-call. Just a more efficient process and lower customer impact, which is obviously what we care about.

Anti-Pattern: MTTX

Which leads me to my first anti-pattern that I will discuss, my version of MTTX. You'll hear me talk about MTTX a lot. By that I mean your mean time to discovery, mean time to recovery, mean time to resolution. I'm on the record of saying that they don't mean anything. Because if I say, we had 50 incidents in Q1, and they average 51 minutes. What is that telling us? It's not really telling us anything. That doesn't mean that like, I think we should just not care about how long our incidents last. I wanted to get those Taylor Swift concert tickets, I'm like, I wanted to get them now. I just think that a single number should not be our goal. What I do believe in is that we want to make the experiences for our users and our engineers better, because that is actually what's going to help us get ahead. There are things that we can do to make the incident response easier and faster so that our engineers are better equipped to resolve them. Just like we will never be at like incident zero, we're not in control of everything. That single timing net metric, just should not be the goal.

A Focus on Incident Analysis

The next way that we apply resilience is when we focus on incident analysis, that's like you're learning from incidents. After an incident, you want to learn from it, or about it. I believe that the best way to do this is this narrative-based approach that can highlight what happened, and can highlight how folks experience the incident from their different points of view. How I experience an incident as a responder is going to be different how my user experiences an incident, how my customers or folks experience an incident, how my stakeholders experience an incident. What we have seen is that those template filling sessions, those 5 why's of root cause analysis, they can be helpful. Sometimes they're not helpful. Sometimes it can actually cause harm, because they make you feel like you're doing something. They give us this false sense of security, when in reality they're not doing much. There's more that you can do. I've done this myself. I've seen it many times. You have this root cause analysis session, and we say, the root cause is human error, so let's just never do that again. The action item is for that person to just never do it again. That's not much to get out of 1 hour with 15 engineers. If you have this blame-aware, learning-focused review, you can highlight different things. Maybe you can highlight that the new on-call engineer didn't have access to some of the dashboards that can help out with the response. Maybe you can highlight that you have a QA process that just doesn't account for certain requirements. Or my favorite, when you realize that engineering and marketing aren't talking to each other, and there's a new feature that they're announcing, and we just don't have enough capacity to handle it. Those are things that we can learn from and that can actually move the needle when it comes to resilience.

Why aren't folks doing this right now? Like I said, people think that they're doing this. People think that they're doing 5 why's of root cause analysis. That is not the same as a narrative approach. I think part of it is that when we think of a narrative approach, we usually default to timelines. Creating timelines is a lot of work. I've done this for many years. You have two screens, you have three screens, and you're copying and pasting. You got Slack here. You got a Google Doc here. Then you have to open GitHub and PagerDuty. You're switching from different data sources, and you're summarizing conversations, and sometimes stuff just gets along the way. Then if you're like, I don't want to do this. I'm not going to do this prep. You have your post-mortem. Like I said, you have like 15 engineers for an hour. Then you're spending that time just building that timeline. That's not really a good use of your time. Your people aren't actually talking and collaborating. Then this leads to folks just not trusting the post-mortem, the learning process. If you want to know more about how to do incident analysis, we actually put out a free guide for doing this. This is called the Howie guide for how we got here. Dr. Laura Maguire was one of the co-authors, as well as myself. Here we outline a 12-step process to help you best understand how you got to where you ended up. I laugh at the 12-step process, because it's long. You assign. You accept an investigation. You identify your data. You prepare for interviews. You write up this calibration document. You help wrap up your investigation. You lead a learning review. You do an incident report. You do more findings, and you share it. You do all of this. John Allspaw is amazing at this. I love it. I think it's great. This is a lot. You can do this or you can actually take from this and take the spirit of what this is, which really is just like a collaborative narrative approach.

If you break it down to the basics, you want to identify your data sources. You want to try and understand who wasn't actually involved in the incident, not just the person who responded to the incident, but like, who were the stakeholders? Are there PMs involved? Are there marketing folks involved? Are there your customer support folks involved? Where in the world were they? What were they dealing with? Like if I'm in Chicago, and I have somebody in Sydney, we are interacting differently because we had different parts of the day. Where did it take place? Were people in-person? Were they in Slack? Were they looking at each other face-to-face? Then you want to prepare for your meeting. You want to create your timeline. Usually this is where I go in Slack, I create a narrative timeline. I jot down any questions that I have for people. I can have an interview, or I can have those questions be at the review meeting in front of everyone. You can cheat. You can look at the top moments, but really, the key moments are the ones where people are not knowing what they're doing, not knowing what to do. Then you have your meeting. When I'm leading a meeting, or when I'm doing an interview, it should never be the Vanessa show. It shouldn't be like this, where I'm here and I'm lecturing you all. It should be a collaborative conversation. People should be able to tell me what they experience from their own point of view. Like, "I'm a customer support person and I was getting inundated with requests from our users." That's impacting how I experience the incident. Then you finalize your findings. You finalize them in a format that you can share with others, that others are going to understand. Depending on your audience, you will want to share different information with them.

We want to make it as easy as possible for people to do this. Doing a narrative-based approach can take weeks, or it can take 20 minutes. Here's a narrative builder, which we have in Jeli. The idea is that you take your different data sources, and you create this narrative. Here, you can see that an incident is never like, we released this bug, and we reverted the bug, and now we're done. You can understand that maybe the reason why the incident lasted as long as it did was because some people realized halfway through the incident, that they didn't have the full impact of what was going on, because they were looking at different dashboards. Knowing that can lead to actual learnings and actual change in how we do our work, which has direct impact into engineers meeting their goals. Engineers not looking at the right dashboards has nothing to do with the bug, but it has a lot to do with how we respond to incidents. It's really important to give people the ability to create these documents, to create these artifacts that they can then share with others, to give them the ability to share how the incident happened from their own points of view.

Then we talk about action items, because, historically, we all think of retrospectives as a source of action items. I really believe that the action items should reflect the insights that we gained from the narrative and from the incident reviews. That it should reflect what we learned about the contributing factors and the impact. The action items shouldn't be just like, revert this bug and be done with it. We've seen this in the wild. We've seen some examples of people that I've worked with. At Zendesk, they had an incident that highlighted the need to just rethink their documentation and the information that's in their runbooks. They did this narrative builder exercise, and they realized that the responders just don't have access to the right things, or they're working with outdated information. I've been there as a responder as well. I've been there where I pull up a wiki page, and they have just things that worked two years ago. You're understanding what things we can do in the future to make things better for the engineers themselves that are solving incidents today.

At Chime, another organization that we work with, they did an incident review, and they realized how some of the vendor relations impact the way that an incident is handled, in order to improve the process. Because at the end of the day, it's important to know how to get a hold of a vendor. Sometimes the people who are on-call are not the ones that either know how to find that person or have the access to do that. Even knowing what parts of the process belongs to vendors versus us makes a huge difference next time that we have an incident. Again, we're never going to be at incident zero, so these action items are helping us solve incidents in the future. A focus on incident analysis can identify the areas of work that's going to lead to engineers working effectively to resolve issues, so that the next time that they encounter another incident, similar incident or otherwise, there'll be better positioned to handle it, leading to just lower customer impact and fewer interruptions.

Anti-Pattern: Action Items Factory

Then my next anti-pattern is this idea of the action items factory. When I talk about resilience, I just want to make sure to stay away from the anti-pattern of being an action items factory. The idea that we have an incident and a post-mortem, and all we do is just play Whack a Mole, and like, let's have an alert for this, an alert for that, an alert for that. We're going to put them all in this Google Doc and this Jira thing that's never going to get prioritized, and it's going to be there forever. Because that's not good. It erodes trust in the process. I remember when I started at a previous organization, I'd go in there. We have 100 incident reports open because they all have 10 action items that are still open from 5 years before. I was still in college back then. If the goal of the incident review is to come up with fixes, but the fixes never get done, then engineers and stakeholders aren't going to take them seriously. Instead, they're going to feel like the retrospectives are just a waste of time for a group of people that are already really busy. Instead, we should look at resilience that's coming out of incidents as part of different categories.

What's the problem with the action items today? What we hear often is that action items just don't get completed. If they do, they don't move the needle. That's like, let's just play Whack a Mole and create more alerts. Another problem is that they're not applicable for the whole organization. Maybe it makes sense for one person, but it's not going to get done. It's not going to actually move the needle. It's almost as if finding the right action item is like a bit of a Goldilocks situation. When we think of action items that are too small, I think they are things that shouldn't really be action items in the first place, because they're already being taken care of. Engineers love solving issues. I was trying to troubleshoot my newborns. They shouldn't wait for the post-mortem to address these issues. They shouldn't wait for the post-mortem to do a cleanup, or fix a bug, or something like that. We shouldn't spend our precious post-mortem time on these. I do think that folks should still get credit for them. I think they should still be included in the artifact, because when we have an incident that is similar to that, again, we're going to want to look back and try to see what we did.

Then we have the two big action items. The problem here is that they're going to take just too long to complete. You have action items that are going to take whole years to complete, and by then we are not even having that product anymore. If you're tracking completion, which a lot of companies do, that's going to skew your numbers. We can talk more about numbers and stuff like that later. The other problem is that the people who are in the meeting, the people who are in the post-mortem aren't the ones who get to do the work, aren't the ones that get to decide. Maybe it's a cross-team initiative, or maybe I say I'm going to do this, but my manager is never going to give me the time to do it. These large action items are still important takeaways from the post-mortems. In that case, I actually recommend that you keep them and you call them initiatives, and you just have them in a separate space in your artifact. Or, rethink the scope. Instead of the action item being like, let's rearchitect this thing. It's like, ok, let's have a conversation about what it would take to rearchitect this thing. Or like, let's have somebody drive the initiative. Another tip that I have is to always have an owner for your action item. This actually really helps with accountability. I actually recommend having a direct owner as well as an indirect owner. This is a hot tip that I did at a previous organization where like, if I had the action item assigned to my name, my director also gets it assigned to their name. Then eventually, they feel embarrassed for having so many action items open, and they're like, "Ok, Vanessa, how about you get the time to do them."

What does the perfect action item look like? It's usually something that needs to get prioritized, and that will get prioritized. It's something that can be completed within the next month or two. You don't want it to last forever. Again, that's an initiative. It's something that can be decided by the folks in the room. It's also a good flag to invite a diverse group of people to these reviews: you want your managers, you want your product managers. It's something that moves the needle in relation to the incident at hand. I really like this chart, mostly because I created it. It shows the difference between all of these things that I mentioned. It shows the difference between a quick fix, like your cleanups, and an action item. Redesigning a process, updating documentation. Your larger initiatives, switching vendors, rearchitecting your systems. Then a transformation that can actually take place out of an incident, like org changes, headcount changes, things like that.

Let's go into an example of how to get there. Here, you can see that we are creating a narrative. We're looking at a specific line in the Slack transcript. We're asking a question, in this case, about service dependencies, and how they relate to engineers understanding the information that's in their service map vendor. I've been in incidents like this, where you realize that you have the service map vendor, and they have an understanding of how things work, and we have a different understanding, and things just don't match up. Here you can see how that question made it into a takeaway around what the service touches and what it depends on. Finally, in the report section, you can see that there are action items around the understanding of the dependency mapping. You can see that these action items meet the requirements that I said earlier. They have ownership, they have due dates, and they move the needle in how we respond to incidents in the future.

A Focus on Cross-Incident Insights

Finally, we have a focus on cross-incident insights. Cross-incident insights is the key to getting the most out of your incident. You will get better quality cross-incident insights the more you focus on your individual high-quality reviews. It's really a great evolution from that process. The more retrospectives, post-mortems you have, the more data you can then use to make recommendations for larger scale changes. Doing this shouldn't be one person's job. We're going to talk a lot about collaboration. We've talked a lot about it the past few days. This is done, I believe, in collaboration between leadership, multiple engineering teams, product teams. It's your chance to drive those cultural transformations that I mentioned earlier. Maybe you're deciding that you want to DevOps part of your process, or you want to rearchitect the system, because you have the breadth of data, and you have the context to provide leadership for future focused decisions. We spoke earlier about having goals around metrics. Doing cross-incident analysis is actually allowing you to provide context around those timing metrics and make recommendations for how to improve the experience for our users and our engineers.

Why aren't folks doing this? A lot of people want to do this, and they actually think that they're doing this, but they aren't. That's because they just don't have the data. High quality insights come from high quality individual incident reviews. It's very hard to get good cross-incident analysis if you're not doing actual incident analysis. Also, the place where our incident reports live, are not friendly for data analysis. If you have something living in a Google Doc, like Google Docs are great for narrative storytelling. They're not very easily searchable, or queryable, or anything like that. Most importantly, and this is something that is close to my heart, is that a lot of people think that they can be analysts, but engineers are not necessarily analysts. That's ok. If you want to do this work, you need training and presenting data in friendly formats. You need to know how to use the data to tell a story. All of this is hard work.

Here are some examples of some cross-incident insights that are coming with context. This is coming from Jeli's Learning Center. Instead of simply saying we had 300 incidents this year, we can take a look at the technologies that are involved, and we can make recommendations based off of that. I can say that something in the past year related to console was involved in most of the incidents, let's take a look at those teams, and maybe give them some love. Maybe help them restructure their processes, focus on their on-call experience.

Make those recommendations to make their lives better, if we want to lower that number. Again, this stuff works in the wild. We worked with Zendesk. They were able to look at their incidents, and they were able to make recommendations around the rotations of certain teams and technologies. Had they not had the data from individual incident analysis, making this case would have been a lot more difficult. We have our friends at Honeycomb as well. They use this information to drive continuous improvement of the software. We've seen people use these insights to decide on a number of things, feature flag solutions, switching vendors, even making organizational changes from cross-incident analysis. I actually was part of a reorg. Thanks to that, where I was like, our SRE team is like, we have so much stuff going on, let's try to split things up. Let's try just experimenting something. These are the larger initiatives that we discussed. All of this creates an environment where engineers are able to do their best work and achieve the goals of their organizations.

Now let's go back to MTTX. Here's the part that I will tell people, "I will never tell an analyst to tell their CTO that the metric that they're requesting is silly, and that they're just not going to do it." Because I've been that engineer, I've been that analyst, and I did not get paid enough to have that conversation. When you have leadership asking for these metrics, take this as an opportunity to show your work. I call this my veggies in the pasta sauce approach. Now that I have children, I should call it like the veggies and the smoothie's approach. It's the idea that you're taking something that maybe you don't think has much nutritional value, which is that single MTTR number, like 51 minutes. You're adding context to make it much richer, to bring a lot more value to your teams and to your organizations. It can help you make strategic recommendations about the data that you have. It really helps move the needle in that direction that you want. You can show that maybe incidents lasted longer this quarter, but the teams that maybe had the shorter incidents, they have some really cool tools that they've been experimenting with. Maybe we should change that and give other teams access to that, spread them across the organization.

Anti-Pattern: Not Communicating the Insights

That leads me to the last anti-pattern, which is, again, something we're not very good at as engineers, and it's communicating our insights. Because if we have insights but nobody sees them, what's the point? In Jeli's Howie guide to incident analysis, when discussing sharing incident findings, we explain that your work shouldn't be completed to be filed, it should be completed so it can be read and shared across the business, even after the learning review has taken place and the corrective actions have been taken. This is true about individual incident analysis, but even more so about cross-incident analysis. I've been an engineer my whole life, and I know how we are. We're like, we all know what we need to do, we just are not doing it. Senior leadership would never do this. We're like, yes, that's true, but do we all know that? Have we told other people this? Often, when people don't interact with our learnings, it's because the format just isn't meant for them. Different audiences learn from different things in different ways. If you're sharing insights with your C-suite folks, maybe your CTO is going to understand the extremely technical language that you're using, but everybody else is going to be looking at their phones. You need to use language that your audience is going to understand. You should focus on your goal. Why are you sharing this? Think of the impact. Don't spend too much time going through details that just aren't important to your goal. If you want to get approval to rearchitect a system, explain what you think it's worth doing, what problem it's going to solve, and who should be involved.

Focus on your format. Whether that includes sharing insights, sharing numbers, telling a story, and sometimes those technical details. I usually like to think of my top insights. For example, I will go up to CTO and say, "Incidents went up this quarter but usage also went up." Or the example that I mentioned earlier, SREs were the only on-call incident commanders for 100% of our incidents. There's your number. As we grow, this is proving to not be sustainable, so I think we should change this process and get product teams involved. That's how you move away from like, we all know what's wrong, and nobody's doing anything, to actual changes. I've had this experience before. The SREs in getting product teams involved made a change. I've rearchitected actual CI/CD pipelines by going up to people and saying, "So many of our incidents, the root cause isn't how we release things, but it is impacting how long it takes us to resolve them. After talking to all of these subject matter experts, I think we need to have a project this year where we rearchitect this whole thing. I know it's going to take some time. I know it's going to take some money, but it's necessary if we want to move forward." Because you're not suggesting this 4-month project because you feel like it, you're suggesting it because you have the data to back it up. If you're asking me for these MTTR numbers, I'm going to give them to you but I'm also going to use that as an opportunity to give you the context, and to give you the recommendation of the thing we all know we need to do.

Going back to this chart, so instead of thinking of action items, or just recovering from downtime, being the only output of your resilience work, let's think of this as the product of treating incidents as opportunities. Doing this work is how organizations get ahead: how we make org changes, and headcount changes, and transformations. Because we will never be at zero incidents. Technology evolves and new challenges come up. By focusing on resilience, on incident response, incident analysis, cross-incident insights, we can lower the cost of them. We can be better prepared to handle incidents, leading to just a better experience to our users. A culture of where engineers are engaged, and they have the bandwidth to not only complete their work, but to creatively improve things. That's why we're hiring engineers and paying them a lot of money. To just achieve and surpass our goals, which means that people can make more money.

See more presentations with transcripts

Recorded at:

Mar 14, 2024

Vanessa Huerta Granda

InfoQ Software Architects' Newsletter