Transcript
Kyle Lexmond: I'm sure most of us, if not all, heard about the AWS outage in October, or the Azure outage a few days later, or Cloudflare outage Monday. I just want you to think like, what did you do during that vendor cloud outage? How were you affected? Were you on call? Were you pulled into your own incident management process? I imagine some of you were probably able to say, ok, it's just the vendor. We just got to wait. Others were probably like, we have DR strategies. Let's apply them and fail over to somewhere else. I just want you to think back to what you were feeling in that situation. Then imagine, what do you think the incident process was like at the respective vendor? I'm Kyle. I'm a production engineer at Meta. I'm going to annoy the company recruiters a bit by saying my role is roughly analogous to a site reliability engineer role. They're very particular about that, but just think I'm here as an SRE. Today I want to talk about incidents, and particularly incidents that leave a mark on you, stick with you.
I'm focusing a bit more on the human side of incidents as well as talking about some things that I've seen that make experiencing or surviving an incident a bit easier. I've been working in tech for a while. I put down on my notes, I've been doing tech work for money since 2010. I got into SRE at Twitter in 2013, and I've seen my fair share of incidents. In the going over this talk, Lorin asked me to describe what it's like to be in the room during a big incident. I'm not too sure I'll be entirely successful, but I'm definitely going to give it a shot. My talk today focuses on my personal experience of living through a bunch of incidents. I'm going to talk to you first a bit about what's my thesis on incidents, that's derived from my experience. Then I'm going to walk you through two incidents to try and give you what that sense of being in the room is. I'm going to close with, as I mentioned, some things that will hopefully make living through incidents a bit easier. This talk will have no numbers, no dates, no company names, except where that's in a positive light. This is partly because the companies themselves don't matter. As I said, I'm focusing on the human aspect of the incidents. I want to talk more about the timeline and my personal experience.
What is an Incident?
First off, how do you define incident? I personally think the term incident is a bit unnecessarily loaded. Though I will say given a choice of having an incident and not having an incident, I will prefer not to have an incident. That said, as an SRE archetype, I can usually point out the Google Site Reliability book as a reference for most things[cite: 1]. When I was preparing for this talk, I was like, surely, they must have a definition of incident. I was surprised, no. They don't actually define what an incident is. It very much seems to be treated as a you'll know it when you see it. To me, I have three factors that make up an incident. First, it's a business impacting event. There's some metric that's actively moving in the wrong direction. Secondly, this event requires an immediate response.
Either the metric is an early warning thing and you want to jump in and fix it before stuff gets worse. Or some exec has said, this is something you need to monitor. If it drops below this threshold, fix it. Finally, the event needs to be mitigated so the impact stops and everything goes back to normal. For those new to incidents, I will point out, the last bit is crucial. You're not fixing, you're not solving the entire problem. You're just mitigating it so the customer impact goes away. I will say, though, that that differs from an incident that I care about. That's a very corporate definition, where it's like the business as a whole cares about incidents. There's an additional factor for incidents that I personally care about, and that's an incident that I or my team are on the hook to go and mitigate.
Personal Thesis on Incidents
That brings me to my first point about being in the room. You're going to feel a lot of pressure during an incident. My distinction between incidents effectively comes down to how much pressure I felt to mitigate it. Roughly, you can think of this as like, pressure is going to be proportional to both the severity of the incident and how much I'm on the hook to actually go and fix it. This pressure comes in multiple areas. There's going to be things like personal pride, "My service is down. No, I've got to go fix it." There's a bit of a professional aspect, other people are depending on my service, which is down. I don't want to disappoint them. I should fix it. There's also the company level, where major incidents are generally bad optics for the company. This leads me to like, here's my thesis about why I care about incidents, generally that I own.
I like to think that I take incidents seriously. This is partly self-interest, because I believe that there's something to learn about in any incident, particularly around my mental model of the system. Did the system react the way I expected under these conditions? Yes, awesome, I have more concrete evidence. No, ok, I've got to go change my mental model of the system, update it to match. Here's what actually happened in reality. Another part of like, ok, yes, it's also the professionalism aspect. There's the business case. Yes, incidents look bad to the business, but my service, a bunch of other people internally and externally depend on it. I like to make people's days as good as possible. My service being down does not contribute to that. I would expect that most people have similar motivations for caring about incidents themselves.
I also think that your motivations will change over time. As a new grad, fresh in the industry, I placed a lot more weight on the personal pride aspect. Where it's like, no, my service is down. Ok, I've got to log in and check on it. This is a path to burnout, personally. As I became more senior, the weight that I was putting on that as a source of pressure tailed off.
Being in an Incident
With that in mind, I'm going to move on to the part about being in the actual incident itself. As you might guess, the fact that I mentioned pressure, that's going to be the main experience through this. In a high-pressure situation, you're tense, everything like you've already stressed. Thinking to the cloud outages, how did you personally feel if you were impacted by this? Again, what do you think it was like in the vendors' war rooms? Going in through all this, recall that your primary goal in the incident response is mitigation. Get the incident neutralized. You're going to be thinking a lot about like, how does the system work? Are there going to be feature flags where you can turn off broken features? If another system that you depend on has fallen over, can you route around it? How quick can you get to the point where there's no impact to your customers? Crucially, can you do all this under pressure with a lot of people looking at you?
Incident 1 - A Rolling Overload
Let's talk about incident 1. This first incident has actually shaped my view on incident management, in part because it happened fairly early on in my career, in part because there was a lot that went wrong. As a bit of background, this is an actual supporting service for a product that does a few million QPS. My service can be turned off without impacting reliability of the main service, but the user experience degrades a fair bit. My story starts early in the morning, 9 a.m. This is relevant because I'm on the East Coast for once. I'm visiting friends. The majority of the team and the person who was on call are in Vancouver or West Coast. I see an alert come in on IRC. I'm junior. I'm still focused a lot on the systems. I'm up.
Not necessarily on PTO, but saying, if I work, I count this as a work day, so I don't need to take PTO. I'm online in the morning. I'm on IRC. I see our alert come in. I know the on-call is about to get woken up. How about I just help triage it a bit and see what's going on? I go and investigate, and it's like, ok, about half the machines in an AZ are down. It looks like a power outage. I know how to fix this. We've got a runbook for this. I pull it up and start running it. An hour later, I've gone and followed the runbook, but the same AZ keeps failing about five minutes after I think everything's been fixed. I try it again, the same thing happens. This is the third time it's happened. Clearly, it's not a power problem. At that point, I say, this is probably an incident. Let's declare it. Because it's an incident, people start seeing it in our incident dashboard, and they come in.
We've got people from the kernel team. We've got people from the automation team. We're digging in. We're checking dmesg on the servers. Why are they powering off? Do we see anything in the kernel logs? Do we see anything in the metrics that says, here's the point where it just disappears, stops reporting. What was it doing immediately before that? We start digging in and trying to figure out what makes this machine special? Why are only these machines having a problem? One of the things that we don't realize is this problem is actually spreading. The rough number of machines that are failing is constant, but it's different machines that are failing. An hour later, at this point, I've got time pressure. I have to head for the airport. The on-call come in, ask me a bunch of stuff in the meantime, figuring out, what's going on? What are we doing? Why is this going on? We're still trying to go through the problem.
There's no particular handoff moment. I'm walking to the subway. I've got my phone out. I'm reading IRC, seeing what's the latest, responding to people. I lose network connectivity for a bit, going through a tunnel and come back out. Ten minutes later I get the rest of the messages off of IRC, and it's like, I need to respond to that one and that one and that one. You can see the time pressure here. I'm on the move. I've got to get to the airport, go through TSA. On-call is still like, you've been in this incident for 90 minutes at this point. Where's the progress? We haven't made any progress at this point. We're still stuck in our investigation steps. Fast forward a bit. I'm through TSA. I'm at the airport. It's to the point where there's a final boarding call for my flight. At this point, I'm on blue jeans.
More people have joined the call, trying to figure out what's going on. The impact has spread. It's becoming very obvious. I actually get to the point where it's like, ok, I have to go to the airline desk and tell the assistant there, "I'm really sorry. I'm on a very important call. Can you uncheck me in from the flight?" Literal years later, a colleague and I are talking about this incident, reminiscing, and he remembers that I was actually at the airport during this. It's like, yes, years later, I still wonder what the reaction the check-in assistant, like, what did she think I was doing? I hope I wasn't the only person that she's seen go up to her after security and say, "I'm sorry, I've got a call. Uncheck me in." I'll never know. I'll probably never see her again.
Years later, this is an example, it's like, the kind of small things that stick with you during an incident that just stand out in your memory. On the incident side itself, we still haven't solved the problem. We've realized now that it's not just this one AZ, it's actually multiple AZs in this region. Surprise, we have three more regions that are also suffering this problem. At this point, it's like, we just keep on bringing in more and more people and trying to figure out what's going on? Why are we having this problem? At this point, given all the people that have come in, I'm sitting on the sidelines at this point. I'm still responding to people because I was the first responder and being able to say, like, "Yes, no, we tried this back here a few hours ago. We did try fixing the automation that was trying to reboot the machines that was rate limited because there were too many machines to reboot. We fixed that an hour ago."
I'm still sitting here doing a bit of direction, trying to link people with stuff that we've done. Another two hours go by. At this point, I'm exhausted. I'm nervous. It's been six hours since the first alert fired. It's been five hours since I said, ok, let's declare this as an incident. We haven't fixed this problem. Our success metric by now is hovering around 40%. We do understand the problem now. These are rolling overloads, and just the sheer load of traffic, as new servers come back up and recover, they crash again because of just crushing load on them. We understand the symptoms, but not the cause. My enduring memory from this moment is I'm sitting on one of the bench seats at the airport, and I'm thinking, we are the experts in this system.
We've got all the people here who built the system, and we don't even understand the system well enough to know what is going wrong. That is the clearest memory that I have, and it's sticking with me to this day, is that moment where it's like, what do we do? People were talking about the nuclear option. This is a rolling overload. We've tried to cut down on load. We've tried to firewall off. We've said drop client requests. None of this has worked. We are not seeing recovery. The nuclear option here is to turn it off, firewall it from any requests, bring it back up, and then slowly remove the firewall. With any distributed system, the fact that this is very sensitive to time, where it's like slowly remove the firewall, that does not inspire confidence. Given that, yes, this is the very nuclear option, people are thinking, is there any better option?
What can we do? Now is the time to throw stuff at the wall, see what could possibly fix this before we take this final step. Fast forward another two hours. We've practically given up at this point. We've firewalled off two regions and brought them back up, and they have both failed. The incident manager at this point saying, "We have go/no-go. We have to fix something. We've been trying incremental stuff, and it hasn't been working." Out of nowhere, one of the original people who wrote the system, who had changed teams and was brought back in on this call because he knows part of the system, he has an idea. He goes and fiddles with the health checks to say, ok, just always report healthy. Do not say that you're overloaded, do not say anything else, just always state that you're healthy. We're going to force the traffic to be distributed this way, evenly across machines.
Still, to this day, I don't know how he had that idea. I don't know if he found it in like, here's another incident report, Google had a similar incident a while back, and he was able to say, "That sounds very similar to what we have. How did they fix it? They fixed it like this. Let's try applying that." That might be what happened. It might be he knew the system well enough and just didn't connect the dots until half an hour into the incident when he got pulled in. Whatever happened, however he got to that point, we spent about 15 minutes implementing the fix. Push it out, let the auto-deployment go, and we start seeing recovery within five minutes. At that point I'm still in the airport, I'm still on this bench seat, and I just sag. Like, we're done. Thank God. There's still a lot of cleanup to do, but again, the focus here is we mitigate the incident.
We are able to get to the point where we say, we've roughly solved the problem. There's no more impact to our downstream users. We see our successful request metric start climbing back up to 100. We call the incident mitigated. Everyone like, ok, this was 8 long hours, let's not do this again, we'll pick this up tomorrow morning. It's mitigated overnight. I finally get on a new flight that I just freshly booked, and I think it was literally the last flight out of the airport, and I'm exhausted by the time I get home. I think I was like I sleep in until 11 a.m. the next day. That's a preview of here's the toll of long-running incident, high pressure, how your body physiologically reacts. Reviewing this incident, what happened up to the point of mitigating it, the big thing here is the pressure was horrible.
Again, the defining trait of this incident for me was like, we as the service owners were practically paralyzed by our own confusion. Again, thinking like, no, we're six hours in, and we haven't solved this. Are we actually all just clowns? Do we really know what we're doing here? There were a few things that definitely could have been done better. I was the first responder. That was fine. I was trying to juggle things as well as investigate. What I should have done was say like, this is clearly becoming more serious. We have a proper incident manager. Go and pull them in. Let them try to do this delegation work. Have someone else who's not doing an investigation focus on being the nexus for communication. I also showed a lot of what's a popular term to get their 'ities.
I'm focusing entirely on the task at hand, ignoring the situation around me. I lacked situational awareness. I missed the fact that different servers were failing. I missed the fact that we had another AZ failing. I did catch another region failing, but that was only because someone else told me like, two AZs down, a third one is going. What's going on? I was like, it's spread. Why wasn't I aware of that? I lost the situational awareness. I was under a bunch of time pressure as well. I needed to get to the airport for my flight. I should have explicitly handed off to the on-call and said, "I need to cut and run. Here's what I've done. You need to carry it forward from now on. Start pulling in everyone you need." I also didn't escalate as far and as fast as I could have. That delay getting the right people in on the incident.
The fact that we had the original designer of the system have a fix in 45 minutes, if I had escalated up to him in the first two hours of the incident, would it have been an 8-hour incident? I'm inclined to say probably not. Unfortunately, I'll never know. As well, in the midst of everything the actual on-call asked if he should pull in more people. It was like, yes, you should. Why isn't that the default? If you don't know what's going on, aggressively pull in people and try to figure out what's happening. The other thing was like, there's also no shared state talk. I spend a lot of my time responding to people like, yes, no, we tried this 90 minutes ago. We tried this two hours ago. This was the state at the start. This is how it's degraded now. Here's the trendline.
Every time someone joined the incident call, we had to quickly recap what was going on. Finally, out of all this, the one thing which I will emphasize here is empathy for the people responding. As I said, this is my first big incident. This is the first long incident that I had. We didn't know what was going on. We didn't even find out the fact that it was a rolling overload until five hours in. You can argue that, yes, we probably should have. There was a whole lot of stuff that we were investigating. There was no clear root cause until people were able to tie bits, separate data points together, and to say, "This is what's happening. This is why we're seeing the rolling overload." That's the first incident: high stress, high pressure. Even right now my heart rate monitor says it's spiking.
Incident 2 - Metric Failure, for Load Balancing Traffic
Let's move on and talk about a second incident. This is a somewhat fun one, because I indirectly caused the incident. The good news is that it's a much shorter timeline. This happened months after the first incident. I had improved. I knew more about what to do. Before this starts, I approve a team member's pull request. We're trying to migrate between frameworks. This migration is the last of six services. We've done five successfully, this is the sixth one. We did a bunch of prep work, and this is the final change where it's a few lines of code to actually do the flip. I look at this, and I was like, yes, this is an old framework. We're not experts on it anymore. Something's probably going to go wrong.
The only time we're going to know it goes wrong is when someone else contacts us. What I was thinking at that time was there's going to be someone depending on some small different behavior of this framework. We've done our tests. Things seem to work. It worked fine for the last five. Let's just go ahead. I was a bit prophetic. Colleague goes and lands the PR, merges his change. Kubernetes auto-deployer picks it up and applies it. About 10 minutes after that, the global load balancer team declares an incident because the metrics they use for load balancing traffic between regions have disappeared. It's not they're being reported at zero. It's just they've stopped being reported at all. What does that mean? In their system, that means a fallback to the defaults.
Now some regions are overloaded with traffic, and some regions are sitting idle. They open an incident. I am now subscribed to the incident feed at the company. I see the incident go by. I was like, that's an adjacent team. That sounds like the service that we just migrated, the frameworks. I wonder if it's the PR I just approved? I join the incident chat and link the PR, and say, I'm going to revert this just in case. Some time goes on. One bad thing here is I'm not actually able to revert this myself. There's been tooling changes. I am not familiar with the new tools. I try a bit and don't make any headway. I go back to the incident chat and I ask for help. Unfortunately, there's a flood of messages, and my request gets missed.
After that, I look, and it's like, I know this person. He roughly works in the source control space. I tag him in my next message, @person, can I get help reverting this PR? Awesome. He manages to act on it. He pushes the revert through. Again, our auto-deployer just picks this up and applies the change. Our systems go back to normal. We're perfectly mitigated. It's like in the span of half an hour, we went from detection, incident declaration, getting pulled in, and then solving it. There are still a few things here that could go better. In the immediate aftermath of this incident, this was actually a decently large internal outage. The Seattle production engineering org actually ordered both the person who wrote the PR and me, ordered us our favorite drinks, and got it delivered to our houses.
It was definitely a bit of a camaraderie bonding moment, where it's like, a bunch of people were impacted. We all came together. It's like, yes, we're going to show empathy for the people who, at the very least, partially caused this incident. The good thing is that, with the exception of my delay in managing to revert the PR, this was actually a positive example of an incident. Between the speed, the fact that like, ok, they were able to detect this very quickly, they declared the incident very quickly, and the right people got pulled in, either by monitoring the chats themselves or the appropriate escalations, it was fast. It's not the fastest stuff that's been mitigated, but it's up there in 30 minutes. Post-mitigation, the global load balancer team actually went back and tried to investigate their service and say, how can we be more resilient to these failures?
How can we go back and say, we should not actually fail under this case? We should just keep the last values that we saw and raise an alert. This decision came out of the investigation that they themselves did during the incident review. It wasn't just like, that team changed their framework. They need to fix it. We'll just make sure they do their own work. We're fine. No, they went and investigated their own system because they were able to say, this system did not act the way we expected it to. How can we make it act the way we expect it to, or rather, the way we want it to? I'm citing this as a positive example of an incident because three weeks later, a similar thing happens where the metrics disappear. It's not my team this time. Metrics disappear.
They get alerted, like, something's gone wrong. There's no customer impact because they managed, in the span of like from review to this next incident, put the changes that they need to make as high priority, got it done, got them shipped, and it just worked. I think this is an example of what we want to take away from incidents. Every incident is a learning moment. We can improve the system if we spend the time and effort to go and figure out, what can we be learning from this?
Making the Incident Life Better
Moving on to the third part, how do you make your incident life better? First, let's talk about things that we can do before an incident. I'm going to introduce this thing called the Swiss cheese model of protection, where you have layers of protection in the system expecting to say, ok, even if one layer fails or one layer has an outage, my next layer is going to protect me and not take everything down, and avoid this impacting other people. My favorite class of incident is the incident that doesn't actually happen because we recognized the potential for these incidents to happen. We've gone and said, how can we actually design our system so that we don't have to worry about this? Our fundamental design and tooling will prevent this class of incident from ever happening. Some issues can't be designed out of the system. This is where the Swiss cheese model comes in and says, what layers of protection can we add?
Now, individual on their own, each protection will be insufficient to protect everything. The idea here is we have multiple layers of protection, so you don't actually have impact from any one failure. This is more of a system design question more than anything. What do you design into a system to make it more tolerant of failures? What failures do you expect to see? Can you recover quickly from those failures? In this case, if we're talking about data coming into a system, sure, you don't think, what am I going to do if the metrics fail offhand? You can probably figure out that this is something that you need to think about by thinking of the general class of what data do I ingest from other sources? What happens if that data is corrupt or missing? Then you'd be able to say like, this is probably a case that we need to catch and be able to figure out something for.
I will also say, multiple layers do add complexity to the system. In this case, if the tradeoff is more complex for being more resilient, where do you land in this? Where do you land on the spectrum of more complex, more resilient? Some things here, I would say, for one, consciously manage risk. I was mentioning, go and think about like, where are your data sources? What happens if those fail? Don't focus directly on the, here's where we ingest data, here's where we ingest more data, like the general class first. I find personally, that's a way to catch more of these issues, especially where it's like, I didn't actually think that we were pulling in this data. You have to go and look through your codebase, and figure out what's going in.
There are tools so that you can use to understand the data flow within the system. These are relatively company-specific, depending on however you implement your own systems. Going back again to the second incident, these aren't discovered until you go and actually consciously think through the problem. What information am I getting? Where am I getting it from? What happens if it goes bad? Think of your boundaries, like what systems do you call out to? I was in an incident a few weeks ago where it's like, this thing should never fail, but the way we wrote it didn't actually protect it. Or, rather, we assumed the remote thing could fail, and then we wrapped it in a try-except block. We didn't catch one exception. That exception happened, and that took our system down. We're looking at it, and it's like, wait, why did we not catch this?
Just like, this is a different class of errors. Part of that was tech debt. Part of that was we just didn't realize this was happening. It was a bit of a surprise to us. You have to know your systems reasonably well. This is where the tradeoff came in. Yes, we didn't really follow up on this migration task that we had gotten, and now we were paying for it. It's also in the tradeoff, it's like more economical. It's not economical to implement every protection possible. The only situation where it is, is like, ok, you have unlimited effort. You have unlimited time, or you have no timelines, and you have unlimited budget. That's the only point where you could implement all the reliability you want. You, as like your system owners, input to how you build the system, you need to decide, what point are we going to say this is enough?
Focus on what are the critical parts? Where do you expect failures to come from, and focus on that. The other thing which you should probably do before an incident is remove incentives to avoid declaring an incident. This comes down to my experience at different companies. Some companies have a very negative view of incidents where it's like, never declare an incident, or if you declare an incident, your bonus for that half is getting cut. Why would anyone on your team let anyone declare an incident? No, everything's just like, no, it's a bug fix. You have 0% success rate. You just need a bug fix. This is counterproductive. For one, you want to declare an incident to be able to bring people in. Particularly, you want to declare an incident to bring people in early.
If you're on call, ideally you have SLIs. A threshold is breached, maybe you can automatically declare an incident, and pull in more people, and say, "This SLI was breached. Something's gone wrong. Let's go investigate and figure it out." Other times you just spot something that's weird. Then you have to say like, is this an incident? Do a brief bit of an investigation, and then say like, yes, no. At Meta, my team specifically, we've tried to pitch this as, it is safer to open an incident to get more eyes on it. If the incident is deemed to be a false positive, we actually have a specific selection for that in our incident manager, where it says, ok, we mark this as false positive and then we just leave a description. This is what we thought, why it was an incident. It is not actually an incident because it's something else.
There's a lot of cognitive safety there. It's like, we know we won't be punished for declaring an incident, so why not declare an incident at that point? I've pushed this very specifically within my team because I believe the utility for using incidents as a coordination nexus is unmatched, simply because other teams, they're able to say like, ok, there's an incident over there maybe that is impacting our team. If they have an incident, that's probably that, let's go and ask them, as opposed to that second team doing their own investigation. Then going back to our team, it was like, something wrong. It really comes down to the coordination aspect. You should structure the incident roles and responsibilities. As I mentioned earlier, one of the things that I liked in my incident was the fact that there was an incident manager for the first incident, but we didn't escalate far enough and fast enough to be able to pull them in early.
They only joined the high severity incidents. What we should have done was like, impact is expanding, pull them in now. Their role as the dedicated incident manager was able to take load off me where I'm not doing the central, "Yes, no, we've already done that, we've already done this." They're able to take questions from other teams, to say, is this impacting you? We're going to add it to the incident management. As a person responding to it, I don't have to think about this stuff. That also leads to like, you should share information during an incident. This just builds again on my collaboration point. Collaboration helps during incidents. Collaborate as much as possible. Share data that you see. Other people might get ideas for what the cause is.
The best example that I've seen of this is a Google Doc or other platform with a live edit collaboration interface. Where you as someone who's doing something, just drop a note into the doc, "This is what you saw. Here's a link to the dashboard. This is what you think is going on. You're going to do this next to investigate." Just sharp, 20 seconds, so people will be able to say like, "Here's an update. Here's where we're going." It's also really helpful in the future when you're going through incident review to understand what you're doing, why you're doing it, why or why it wasn't successful. Also want to highlight, blame is not a helpful part of this process. The org that penalized opening incidents, yes, I found that people by and large do not usually go to work intending to do a bad job.
There's no point distributing blame based on like, person X landed this bug, they broke our system. In the middle of an incident, figuring out how to blame them, waste time that you could be using to solve a problem. Things like factual statements, this diff landed and that caused this failure to propagate. You can do that without mentioning the person if necessary. In the incident review, the fact that someone did something is generally less significant than why the system allowed this to happen in the first place. You cannot rely on humans being 100% accurate all the time. You're going to have to say, human error is expected, how do we protect against it? Finally, right at the beginning, I mentioned the Google SRE book[cite: 1]. This covers a lot of the same ground. If you do want to learn more, it's the best public resource that I've found so far, particularly the chapter on managing incidents, which this is the URL of, sre.google/sre-book/managing-incidents/[cite: 1].
Recap
With all that in mind, I'm going to recap how I approach incidents now. I will say outright, my goal is not to have zero incidents. My goal is to have zero high severity incidents. Before the incident, we do system review every half, saying like, what dependencies have changed? Or, is there anything that we could be building into our system to make it better? I've set the expectation that we have a healthy on-call. The alarms that we get are high signal. They're clearly actionable. We have runbooks that we can use. They're not just dead links to wiki pages. We've established some thresholds to make incident declaration automatic. I'm still working on this, but we're getting there.
It helps a lot, particularly with new people who join the team who do not necessarily feel confident enough to say like, this is an incident, I'm going to go open one. Finally, also in general, just encourage the team to subscribe to incident notifications, particularly high severity ones, even if it's like not with partner teams directly. During the incident, if any of those expectations are violated, I make a note to fix after we mitigate it. I make a note specifically so I don't lose that information. If it's an incident that happens while I'm starting on-call, I drop a message in the team chat about it to just give people a heads up. If I'm not sure what's going on, I'm going to escalate to people that I think do know what's going on.
If there's a risk of impact spreading beyond this one AZ or whatever the current boundary is, I escalate to the incident manager on-call to say like, this might become a bigger incident where we're going to start pulling in more people. Finally, if we start needing more than three people, I make sure to establish a shared doc so people can just start recording their information, ideas, data points, whatever. Ideally, I pull someone else in to deliberately just update the doc, not be part of the investigation itself. I will also say, quick note on repeat incidents. This is generally a specific case that you should be concerned about. I find that it generally means some part of the system you don't fully understand.
It's a part of the system that is ripe for re-review, particularly if there's anything in your task backlog to like, you need to go fix. This is a point in time where it's like, yes, you apply the principle of this should never happen again driven development. When you watch other people's incidents, remember your experience in your own incidents or remember my experience and have empathy for the people who are responding. What's clear in retrospect is not always clear in the moment. Where I would like us to go, I talked a bit about this in the context of my team. My team has doubled in size in the last nine months, and we got onboarded to the team down path, but we've not got onboarding to the on-call working entirely. Some of it is the culture.
This is going to be a bunch of internal training where I said like, convince people, you need to go and break this habit of not wanting to open an incident. Here, there's no penalty. Don't worry. There's some internal training for us, like Wheel of Misfortune, going over past incidents with what happens when you get alert x, as a way to introduce them to the on-call and the services that we own.
Key Takeaways
Here are the three takeaways that I would like you to leave with. The first is that incidents are not inherently bad. I think I've covered a fair bit. They can be learning opportunities, and justify reprioritizing some of the work. Secondly is, your incident management process should focus very sharply on mitigation. Any improvements to this process should be done with the idea of driving down that time to mitigation. Finally, with all this in mind, still remember at the end of the day, we're human. Remember that it's humans that are responding to these incidents. Consider the pressure that they're under. Don't add unnecessary pressure to both yourselves and them.
Questions and Answers
Participant 1: I like your approach of the Swiss cheese model, but I want to know, are you testing your assumptions? Are you using resilience testing of some sort to test those things out and to make sure that this is a real thing, or whatnot?
Kyle Lexmond: Probably the easiest way to put that is we are codifying some of those tests into integration tests, or we're codifying some of that into our integration tests. We have an internal framework that tries to do, I forget the specific term for it, but it's effectively, you declare that this is a dependency. Then in the integration test itself, you're blocked from network access, and so you have to check like, what happens when that dependency isn't reachable? Do you fail cleanly? Do we effectively test for, does this have the outcome that we expect under an outage-like case? Past that, we don't do that much. A lot of it is on the system design level where we say, ok, more reason about the system. I would like to get to the point of like, to my mind, AWS is the big person in the room here, in terms of being able to say, yes, we've boiled our implementation down to, here's whatever solver that they use. It's like, we've mathematically shown that we shouldn't be having these outages anymore.
I personally don't think we'll get there. That's a bit far on the tradeoff scale of how much effort do we need to do, to, what benefit do we get out of? If something was implemented at Meta, that would probably shift my thinking on this to say like, if there's something well-supported internally, we would probably do it.
Participant 2: My question was around your first incident while you were in the incident, and an incident with that big a blast radius, it usually gets a lot of eyes, means a lot of communication channels open, direct messages. What were your takeaways on how to go about prioritizing who to communicate with and what responses are taking priority at the moment?
Kyle Lexmond: Probably the easiest way to approach that is what I was doing in the incident itself. Before I declared an incident, I'd be messaging people directly. Part of this was in our team's group chat. Part of this was like in historic, was one-on-one or smaller groups. We're just keeping the discussion there. When an incident is declared, we spin up an incident channel automatically. What we had to consciously do at that point was shift all the communication that was going on over to that incident chat. I'll also highlight at this point, we expected everything to be going to that incident chat. This is also where the shared doc and incident manager would help because they're able to say like, track what's going on. That's also where I fell flat of trying to do, "Yes, no, you come in asking this question. We discussed that five pages of, scroll back up here, go and read this link." That part was something which I think I didn't do well then. I think I do better now.
Participant 3: You mentioned building a culture around more openness in incidents so that people aren't hesitating to make incidents. Do you have any advice on the opposite problem where you get incidents created for essentially bug fixes or even new feature requests, and being like, ok, this isn't an incident. How do you push back on that? What's your philosophy when it comes to that kind of thing?
Kyle Lexmond: My philosophy there is, who is the incident owner? If another team has a bug fix that they want to deploy, they don't get to create an incident and assign it to my team. They get to own the incident. They bring in my team as a collaboration team. Or, specifically how this works is they have to create an incident in their own space for their own team. They are still the incident owners, and they pull in my team's on-call to say like, we need a bug fix for whatever service deployed using this PR. My team's on-call will be able to take that request and act on it. There's no follow through on them for like they don't have to do incident report. Ideally, they don't have to do follow-up tasks if the bug fix fails, and there's another fix that we have to do.
We go back to that team. It's like, this broke, what gives? It becomes a bit of a pushback on the bureaucratic level. You want this to happen, you do the paperwork necessary. A lot of what this has ended up being is, I mentioned like there's the false positive tag. Most severities you think of go from 1 to 3, 1 being the highest. We have a SEV-4 level that you just see it and you don't really care about it. The color that we have assigned to it is just gray. This is the low stakes. We're using this as a coordination incident. It's not necessarily a bad thing. I agree that it might be an issue if you're creating too many incidents. The solution that I found is just like, if there's any overhead, push it back on the team that's opening the incident.
Participant 4: I'm an incident manager, and mostly I get the signals and I got to coordinate with the multiple teams to get things fixed. One primary thing I see is like, when I pick a service, for example, an application goes down, I pick that application as like, ok, there's an outage. Most of the times there is nothing for that team to do because something on the backend or some cross-functional team broke something. The pushback I get is why my application is displayed in the outage when I don't have anything to do. I got to put that, because users are not going to say like, check for database or machines. Users are going to look for applications. This is a major problem because I got to choose the application, but it is fixed by some other team. Even when we look into the reports, the team who broke doesn't show up in the metrics. It's like a problem in the industry. Do you have any suggestions to streamline this?
Kyle Lexmond: My first thought there is, is it possible to improve the metrics? You're able to point that, here's the number of requests we made to this other service. Here's the number of successful responses we got back. Because the other team, maybe they don't have SLIs. Maybe they don't know that their service is down. You might be using some small component of their service and, sorry, their success rate is like four nines. You're that fifth nine that's missing. It comes down to being able to say like, here's our monitoring. Our monitoring says that this service is bad. Other service, these are the metrics that we're monitoring. Can you go and fix them? That becomes a bit more of a, I'm not going to say political, but it's going to be like for the managerial level of both teams' managers to just collaborate and say, "This is what we're seeing. This is what is bad for our system. Can you add these metrics to your SLI?" Unfortunately, I don't really have a better answer other than like, get the managers talking to each other.
Participant 5: What's your take about managing leaders or leadership communications in general? What's your experience with that?
Kyle Lexmond: Having the C-suite come into any incident is concerning. There's immediate like pressure, pressure, pressure. It's not always a bad thing. If it's this incident that's big enough that they do come in, that's the point where I'd get like, incident manager, can you go talk to them? If I'm incident manager, I will go talk to them. At that point, I shouldn't be doing an investigation and just handle their requests. That is actually a point where I would say, I would back channel to them. I would not have them post in the main incident chat.
See more presentations with transcripts