InfoQ Homepage Presentations The Time It Wasn't DNS

The Time It Wasn't DNS

View Presentation

Speed:

43:59

Summary

Sean Klein discusses why "human error" is a dangerous myth in complex systems. Sharing the inside story of Azure’s 2023 global WAN outage, he explains how modern incident analysis looks past the "Five Whys" to uncover systemic issues. Learn how engineering leaders can move away from blame, improve Standard Operating Procedures, and design resilient systems that actively protect their engineers.

Bio

Sean Klein has been involved with post-incident activities for the better part of two decades. He currently leads the Production Livesite Review program for Microsoft Azure implementing modern incident analysis methodologies. Previous to Microsoft, Sean worked with Salesforce as well as private consulting. He is a member of the Resilience in Software Foundation.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sean Klein: Principal Technical Program Manager is the title they gave me. Modern Incident Analysis is the title I gave myself. That's what I do. I like to joke that I turn outages into 20-page Word documents and then deliver them to my senior leadership. It's not a joke, it's literally what I do. It's my job is writing Word documents. The method and the tools that I use to create the Word document are what differentiate what I call modern incident analysis, as a differentiator from some of the other post-incident frameworks that people might be familiar with, like ITIL problem management and its derivatives. The basic idea is that for a certain class of incidents, we're going to go super deep on them and do more than the Five Whys stuff.

Little bit of a background. DRI is a term that we use. It means OCE. It's the same thing. It's an on-call engineer. For us, it's Directly Responsible Individual, but it's the person that gets the page when the pager goes off. MOP and SOP, for the purposes of this talk, we can use interchangeably. Technically, a Method of Procedure is like an instance, or you can think of it as like a child of a Standard Operating Procedure. Even in the artifacts we have about this incident that I'm going to talk about, like my leadership use these terms interchangeably. Incident versus outage is a big thing. Within Azure, we use the term incident the way a lot of organizations would use the term alert. It's not something you declare if some threshold is breached somewhere and monitoring will fire an incident and it'll go into a queue.

It's why if you ever hear an Azure engineer say I've got like 500 incidents in my queue, that is also a bad thing, but it doesn't mean they have 500 outages going on. When an incident has confirmed or what we feel is eminent customer impact, we turn it into an outage. An outage is something we declare. That's when, under certain conditions, we invoke our centralized incident management function. That is the larger team that I'm on, the adjacent team to our incident management team. Like a lot of organizations, we use a severity scale for these outages. Typically, severity 1 is our largest impact outage. It's reserved for outages that are breaching some sort of promised containment zone, like a multi-service outage, a full availability zone or multi-AZs, or multi-region, or global.

Then there's this super-secret other type of outage called Sev 0, which is really just like a Sev 1 plus a whole bunch of other stuff. It doesn't change our incident response. We don't respond harder on a Sev 0, but it invokes a lot of other workstreams that we have to, to be able to manage internal and external communications and commitments. As somebody external to the company, as a customer or even not a customer, you wouldn't necessarily have visibility into this severity scale. You don't know we're working on a Sev 0 or a Sev 1. A good indicator that we're working on a Sev 0 is if you're reading about it on Downdetector, or CNN, or Reuters, or ZDNET, or if Mary Jo Foley is tweeting about it, then it's probably a Sev 0. Sev 0s are like our named outages.

They're the ones that you can talk about three years later. This outage that I'm talking about happened almost three years ago. If I say the global WAN outage, everybody knows what I'm talking about. When this outage occurred, January 25th, 2023, we were essentially hard down for about an hour and 40 minutes globally. The WAN is how our customers reach us and it was unavailable. Most of, not even Azure services, but other clouds within Microsoft, like Office M365, Xbox, also use the Azure WAN. If you were a Microsoft customer during this outage at this time, you were impacted by it.

Post-Incident: A Narrative Begins to Form

Even during the outage, before it was mitigated, this narrative started to develop, like, why? What was going on? I'm sure you're all familiar with this, we were doing it with other outages. This narrative of a change gone wrong started, and that's not necessarily incorrect. After the dust had settled, that narrative started forming internally as well. The thing about these giant outages is they start bringing in people that might not be familiar with how outages typically work, like the more senior leaders. My team deals with outages every single day. We live and breathe outages. I, on the after-mitigation side, but the incident management team, it's what we do all day. We forget that not the entire company does that. There are people that end up having to explain these outages to very upset customers or to the board of directors or to external media. Big outages like this bring people into that fold. These narratives can develop and that if left unchecked become the story.

These are actual screenshots from communications, like comments and Word documents and Teams chats. I've anonymized these because I want to keep everything blameless, but also, this is not meant to shame these people. This happens in every organization and this happens because there's this universal human need to make a simple story out of a complex problem. It's what we do as engineers. It's what we do as PMs. It's part of the sense-making process. During an incident, and if you all are on call, when you join a bridge, you don't know what's going on. Users are synthesizing data and you're coming up with theories. You're looking at logs and the pattern of the alerts coming in and what different systems are saying. You're trying to make it simple so you can understand it and figure out where to go next.

These folks are doing the exact same thing, just hours later. They're getting the various inputs in from their colleagues. They're looking at their chats. Then they're trying to figure out how they can communicate this impact to their leaders or angry customers. They're trying to simplify this story. There's danger in that, in that it's never a simple story. It's never DNS. Even when it's DNS, it's never just DNS. In fact, that'll be the title of this talk is, The Time It Wasn't DNS, it was BGP. It's also never just BGP. It's never just operator error. It's never just an engineer that didn't follow the SOP that caused an outage.

The Simple Story - Let's Diagram It

Let's pretend it was. Let's diagram that. I give you my one why analysis here. We didn't have to use Five Whys. That's an 80% efficiency increase. The reason that that story is attractive is because it's easy to tell and it's easy to understand. It's a canonical outage type, even if it's not true. An operator made an error and then we have these repair items that are automatically written. You don't even need to select these from a dropdown list. You can just have them automatically added to your postmortem. It's some combination of like punish the engineer that made the change, either name them or fire them in an extreme case. Then for everybody who's left, we have to increase the mandatoriness of the mandatory training or the frequency of the training, or something like that. There's a little red bow because anybody can tell the story and it's an immediately just accepted, "Ok. Yes, operator error. Makes sense."

When we simplify a story like this, it's dangerous because, say we did do this, we're fixing problems that didn't exist in the first place and we're causing harm in our culture, in our organization. If we punish somebody for something that they didn't actually do or without understanding all of the factors that led to the decision or the activity, and then we just tell everybody else that they need to have more training, we're not actually addressing any of the underlying issues.

What Did Happen on January 25, 2023?

What did happen on January 25th? I know you can't read this. That's the point. We're going to spend the next few minutes zooming in a lot. Some background here. I'm going to ask you for a little bit of suspension of disbelief here. This diagram is not meant to be a specification of the incident. Disregard the fact that I'm presenting it to you as a specification of the incident for a moment or two. I use this diagram as an interviewing technique. It's an alignment document. When I come in and start working with incident stakeholders and service owners and engineering and the OCE that was implementing this change here, it's a way that I use to break them out of this root cause construct. This idea that if we ask why five times we magically get to the cause, and that's the thing that we have to fix.

It's not meant to be a scientific truth. It's meant to be whiteboarding and brainstorming as I'm developing the narrative and presenting it. I'll work with one group and we'll come up with a part of this diagram, and then I'll work with another group and they'll say, no, that's not true. Every single one of these nodes is a contributing factor. It's an event or a condition that had to be true or not true in order for the customer to experience the impact in the way that they experienced it. Jumping into this world of hindsight and counterfactuals, the corollary there would be if we removed any single one of these nodes, there wouldn't have been impact. That's a lot more attractive of a story for somebody like me than the previous simple story about firing an engineer and making everybody take more training.

We're going to dive deep into a bunch of these. I'm not going to hit every single node. Start with the impact. On that day, there were three distinct periods of impact. We were hard down for about an hour and 40 minutes, and then there were two long tail recoveries in two different regions, West India and then North America around Chicago point of presence. All of those are for different reasons. Then, to go back to that simple story, the story of the engineer who performed an operation on a router that caused impact, that happened, that actually occurred. We put that right in the middle of the diagram. An engineer ran a command on a router and the world went dark for us for an hour and 40 minutes. To really understand this outage, we have to go back two months, and we start at the beginning here.

The router that was the target of that command was a new role for our WAN. It was part of a network expansion, both increase in size and then an increase in speed of our global WAN. As you might imagine for a WAN role, that router was provisioned with an external IP address. That's how wide area networks work when they peer to the internet. Later, after this cohort of routers, it wasn't just one, it was actually a collection of 12 router pairs. During that buildout process, it was determined that we needed to change the architecture a little bit. These routers needed to be part of what they're calling our Software-Defined Wide Area Network. I'm not going to go into that, but there's great information up there on microsoft.com if you just search for SWAN. There are some great blogs we have about that.

A nuance of that design is that these routers would need to have an internal IP. A change was required to re-IP the routers that had already been built out. Usually that's a good way to identify the network engineers in the room because they start squirming. That's not a typically low risk operation. We had to develop a SOP for that operation. Because this was a new role, that SOP didn't exist. You might imagine that our network changes come with a high amount of governance. That includes our SOP governance. Usually when an SOP is created or modified, it goes through this pretty extensive change review process. One of those processes, one of those steps is actually emulation. We have the entire Microsoft WAN virtualized within Azure VMs, and we can make a change to it virtually and see what happens.

That's one of the five steps in the process of simply updating an SOP that we would use to later affect a production change. Key to this story is that these were not production routers. No time during this incident were these routers considered production. They were not serving customer traffic. They were not communicating with other routers. They were connected to our IGP backplane. Some foreshadowing there. Also, during this time, during this buildout, there were some inconsistent practices regarding the SOP change processes for non-production or what was considered non-production change work. That SOP was created and modified outside of that governance process that I was talking about. There wasn't peer review. There wasn't CAB review. There wasn't that testing within the emulated environment.

The SOP was created and actually used a couple times on previous routers within that cohort that needed to be re-IP'd. During one of those operations, in fact, the operation most previous to the outage, engineers encountered an issue, where after the re-IP the link state packets were stale. That usually requires you to delete the database within the router that has all the routing configurations. Because this was a new role, we reached out to the router manufacturer for help and they provided guidance to run this command. That command was added to the SOP, but again, outside of this governance process that you might expect for updating an SOP. Then, not much happened for a few months. By this time, we were into the holiday period. We don't do a lot of these types of big changes during the holiday period.

We have targeted change freezes to support a lot of our customers of retail. Then with a lot of the staff and customers out for the holidays, we don't do a lot of this type of change work. January 25th rolls around, and it's time to run the command on this particular router. Engineer executing the change followed the SOP. The SOP is attached to the change ticket, opened it up, read it, unaware that the SOP had been changed. There's this command in there. They run it. In J. Paul's talk, there was a little bit of reference to how we should treat SOPs and runbooks. Tech especially is famous for thinking that an SOP should be infallible. You run the SOP, and if you deviate from the SOP, then that's a problem.

There are other lines of thinking influenced by safety science and resilience engineering that know that humans interpret the SOP and then make assumptions and decisions all the time. You can never have a perfect SOP. We want those humans to interpret the SOP. We want them to know that what they do might be high risk and to do it with a little bit of scrutiny, and if they see a command that they think might be unsafe, to escalate that, or to question that or think about it. You might be wondering why an engineer even seeing a command in the SOP that they didn't know was added out of band, without governance, why they might not question it. A nuance of the Microsoft WAN is we use three different router manufacturers across all roles, which means we have three different operating systems and versions of those operating systems represented in our network.

We do this for a lot of reasons. One of them is just supply chain de-risking. Unknown to the engineer who actually did see that command, was familiar with the command, and felt it safe to run, that command on two of the OSes represented in our network are safe to run, they're locally scoped. On a third OS, it affects adjacency. Meaning that when he ran the command, it had essentially a global scope. Another thing affecting this engineer's mental model is that he would expect an unsafe command to be blocked at the AAA level for our network. It should have been. When a new role is onboarded, we would typically do a big audit. We would look at all of the commands that could have non-local, or more than locally scoped impact on a device, and block it.

That was not done in this case, because really just order of operations timing. We were onboarding these routers, and the audit was in the process of taking place, but hadn't got to the point where we were doing the full command audit yet. On January 25th, 2023, the engineer ran the command, confidently, so confident that 33 minutes later he did it again on the other router and the router pair. The 33 minutes later, typically on a production change, there would be a listening period. We would implement the change. Then this forced basically hour of making sure everything's fine, looking at key health indicators. That process was not in the SOP, because the SOP was not considered a production SOP change. To the engineer's perspective, he was not making a production change.

The engineer had no idea of the impact caused by the first command, and the network was almost healing from that first command being run when he ran it again on the second router. The combination of these two commands had two cascading events. I'm not going to go super deep into network routing theory here, but a good breakdown would be, routers use Interior Gateway Protocol to manage internal communications, and Border Gateway Protocol to manage peering with external internet peers. The command that was run basically resets the routing table for IGP on the local device, hopefully only a local device, but in this case the entire network. The entire Microsoft WAN began recomputing the connectivity within itself for about 30 minutes, and then healed slightly, and then did it again.

That second time was what caused the BGP to start recomputing. Just the sheer size of the Azure WAN, that meant about 15 million routes had to be recomputed, which took about an hour and 45-ish minutes to finish. I mentioned the three different distinct impact periods. During the IGP reconvergence, three devices actually failed. When the WAN came back up, three devices were degraded, and that was due to defects in the devices. Also, during the outage, we paused the system that we use for auto-detection, auto-rerouting, and auto-recovery, because it was contributing to the problem at that point. Everything was down, and so it wasn't helping. Then traffic was not restored in these regions until we turned that back on, we're able to detect these degraded routers, and then manually eject them.

Let's check back on our simple story here. How do we go? I use this a lot as my example of the one why, and say, what's the root cause? When somebody asks me what the root cause is, I say, "The root cause was engineer executed what was understood to be a low-risk, locally-scoped planned change to a non-production router pair, following an official Standard Operating Procedure that had been previously updated by senior engineers to include new guidance provided by the router manufacturer to address issues encountered during a previous execution of the same operation, with a mental model informed by extensive experience with other router manufacturers, and a belief that any potentially unsafe, globally-scoped command would be blocked by the AAA system." What repairs do we have for that? Who do we fire for that? What's the root cause in that?

I'm going to go back a little bit and tell you. I just asked the question like, what repairs do you have for that? A great benefit of a diagram like this, and again, I want to stress, this is not the specification. Think of this as like a brainstorming whiteboarding exercise, but in general, when you have things lined up like this, you can start to see nodes that have a lot of arrows coming from them or a lot of arrows going to them, and those are pretty good indicators that that's a systemic issue or a key issue, or something of pretty deep interest. Another thing is that these contributing factors that are closest to the impact are often representative tactical repair items. They're the things you can do, again, jumping into counterfactuals and hindsight here, that they're most likely be able to prevent this exact outage from happening again.

If your goal is to prevent this exact outage from happening again, those are good places to look. In fact, we've done some of that. This relationship between IGP recomputing and then forcing the BGP recompute, there's various buttons and levers that we can pull to make BGP a little bit more robust and less sensitive to IGP shenanigans, and we did that days later. We did that to make the whole network a little bit more robust against that type of failure. Then as you start getting further and further away from the impact, you start getting into the more systemic or thematic issues. When I talk about SOP change governance, how many outages did that cause previously or almost cause previously? Those are good targets for if you're looking to add resilient behaviors to your systems and to your organization.

Those are the kinds of things you want to focus on, mental models for the engineer. How do you change that? I ragged on training a little bit earlier. Training is good, but it has to be right-sized and meeting the engineers where they are. One of the things that we did for this, instead of just making the existing training mandatory and making everybody attest that they've taken it twice a year or whatever, we now study this incident as part of the onboarding training for the WAN team and for the core networking team. Part of that is to instill this respect for what is a complex system that fails in complex ways, in often unpredictable ways when you're running an SOP.

Final Note and Resources on Azure Incidents

One thing I do want to point out is, almost nothing that I've said today hasn't already been said in our post-incident reports. For many outages, we do these Azure incident retrospectives. These are live Q&A style panels that we have with engineering leaders. If you're an Azure customer, or even if you're not an Azure customer, you can register for these. You have an opportunity to, during the live event, submit questions to the panel and they'll answer them. Then, if you're not an Azure customer, if you go to azure.status.microsoft, you'll see PIRs like this. At the top of one will be a recording of the live session that we did.

Questions and Answers

Participant 1: Did you fire the networking engineer?

Sean Klein: No. It's an unfortunate question that we got from a lot of organizations, like customers. A big outage like this, we end up doing special engagements with customers that were highly impacted for it. A common question we got was, has he been fired? It unfortunately puts our executives in a tough spot, where they have to say, no, that's not what we do, but in a way that doesn't make the customer feel like an idiot for asking. Which is why I don't do those interviews, because I would make the customer feel like an idiot for asking.

Participant 1: Did you find, even internally, when you have a simplistic story, because leaders and people far away may not want to do the whole long story. Did you have to facilitate some of that?

Sean Klein: Yes, a little bit. I mentioned that one of the reasons that narrative develops is because it's like a rhythm, like a chord in music, you know it even if you've never heard it before. It strikes that familiar, and you're like, "Yes, human error." Was he fired? Then the narrative just develops, even without people actively developing that narrative, which is one of the importance, and one of the reasons that I give this talk internally is to fix that, and prevent that behavior, like actively curate a more blameless environment. I'm constantly shocked. We're like 13 or 14 years from when blameless postmortem was first introduced in tech, but I'm still constantly meeting new people who are unexposed to it, or learning about it for the first time, even within Azure. Azure people think of it as like a big monolith. Azure is about 2,000 discrete services, with different engineering teams, all with their own origin stories, and cultures.

Participant 2: You mentioned the idea of blamelessness. In the aircraft industry, they seem to have, more or less, accepted this idea that when an airplane crashes, there's an incident, you don't blame the pilot, you don't look for the root cause, you do this kind of analysis. How do we change this narrative to the general public? Because clearly, as we become more and more dependent on these distributed services, this is going to happen over and over again. How do you change the narrative for the general public?

Sean Klein: A lot of it has to do with working within the framework that you have. If the first step in implementing your program is to change the culture of your organization, then your program isn't going to succeed. You have to find ways to work within it to slowly grassroots change it. I need to make this point too, this isn't the way that we necessarily postmortem every single incident that happens in Azure. We still have a lot of metrics and getting dashboards up to leadership as quickly as possible, and that kind of stuff. We're not going to not do that. The work that I do melds into that a lot. Typically, when a big incident happens for me, I'll be part of that initial, like the first few days, where we're getting comms out to the right people and making sure that we're saying the right thing and that it's truthful, and giving enough information. A typical analysis that I do might take a month, and it shouldn't be time-boxed based on commitments to whoever.

I find ways to create those learning opportunities within the groups that I work with. The diagram that I was using is a whiteboarding exercise, but I use it to build resilient behaviors within the groups that were involved in the incident, and to get them talking to each other, and to ferret out those thematic and systemic issues that are towards that right end of the chart. That's a long-winded answer to your question, is keeping it under the radar if you have to. Also, like you mentioned right off the bat, aviation doesn't do this. When a plane crashes, the idea isn't get the MTTR metric to the president as quickly as possible. We want to learn. We all wait for the NTSB report to come out like a year later. Use those examples. Use examples like in healthcare and aviation and other safety critical companies or organizations.

Participant 3: We follow a lot of these practices that get repeated, like the Five Whys. What resources have you found for this like deep analysis or did you come up with this on your own?

Sean Klein: Communities, Resilience in Software Foundation community. Adaptive Capacity Labs has a blog that goes into this. Is LFI still active? Yes. It started with the blameless postmortem blog from Etsy, from John Allspaw, and it go from there. Safety science books. The STELLA report is an awesome resource. I cut and paste from that all the time. Especially in the blameless section, there's an awesome paragraph in there that says organizations that prioritize blame and accountability are deprioritizing the transfer of knowledge and the free flow of information. The Howie Guide. Yes, the Howie Guide which was part of Jeli. I think it's on PagerDuty's site now. It's an awesome resource.

Participant 4: I'm trying to just relate this presentation to Kyle's presentation on human toll. If I were that engineer who ran a command which impact is like billions of dollars or something, there is a human toll attached to it. He'd probably be traumatized, or he went to run any commands in future, he'll be thinking 100 times before executing. It's like how people deal with that, is the first question. The second question is like the system resilience. We keep talking about like if I execute a command and system allows me to do it, during the whyscapes it comes to, the system allowed me to do it. If you think about it, there will be few ways to execute the command in a proper way, but there will be a thousand ways to incorrectly execute it. If we start putting code like to avoid, so we'll probably have more code to prevent the wrong thing from executing. How do we find the balance?

Sean Klein: The first question about the engineer. I rarely talk to the engineer that performed the action as my first interview. I want to have a little bit of background. I want to understand the culture of the team that he's operating in a little bit more. Then I go in. Only once or twice have I come across an engineer that was fearful that they were going to get fired because of what they did. Always. Even if they're not, I reinforce that that's at least not what I'm here for. From my perspective, and this leads into your second question, we failed the engineer. If the engineer ran a command that took down the world, that's the system's fault for not protecting the engineer. We should have systems that help the engineer make good decisions and then protect the system for when they don't. All of our systems should be designed for that. We should be giving the engineer insight into the state of this system and the effect of what they're going to do to it.

See more presentations with transcripts

Recorded at:

Jun 23, 2026

Sean Klein

InfoQ Software Architects' Newsletter

The Time It Wasn't DNS

Summary

Bio

About the conference

Transcript

Post-Incident: A Narrative Begins to Form

The Simple Story - Let's Diagram It

What Did Happen on January 25, 2023?

Final Note and Resources on Azure Incidents

Questions and Answers

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ