InfoQ Homepage Presentations Week-Long Outage: Lifelong Lessons

Week-Long Outage: Lifelong Lessons

View Presentation

Speed:

Download

49:32

Summary

Molly Struve discusses a brutal six-day outage that nearly sank a company. She explains technical lessons like the importance of FMEAs, shadow traffic, and exercising rollback mechanisms. She shares why the human elements - widening your circle early and having a VP who acts as a defender - are what truly build psychological safety.

Bio

Molly Struve is a Staff Site Reliability Engineer at Netflix with a degree in Aerospace Engineering from MIT. She is passionate about building reliable and scalable software and teams. Her diverse experience includes leading globally distributed teams, architecting databases, and optimizing complex systems and processes.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Molly Struve: Who is excited to hear the story about one of the largest outages of my career? Give me a good outage story any day of the week. They're always like a murder mystery story, trying to figure out who done it. Was it the feature code with the bug? Was it the configuration change? Was it everyone's favorite, DNS? It's always DNS. Except for, in the story I'm about to share, it actually wasn't DNS. You're going to want to pay close attention to figure out who done it in the end. We've all established, everyone loves a good outage story.

The next question I have for you all is, who enjoys being a contributing factor to an incident? We got a couple folks that like to live dangerously. I don't judge, maybe don't work in the automotive or airline industries. That's all we ask of you. I think most of us can agree though, that contributing to an outage or an incident is not very fun. My goal with this talk is to entertain you with a roller coaster ride through one of the biggest outages of my career. What comes after, the lessons learned, while not as exciting, are even more important. My hope is that these lessons can save you from a massive outage of your own, and further empower you and your team to handle an outage when it does happen. While this outage was brutal, we learned a ton. The experience has also shaped my career and influenced me to become an SRE. Enjoy the story and pay close attention to the lessons that come after.

Setting the Stage

I want to set the stage and give you a little background so you can better grasp the story I'm about to share with you all. During the story, I'm going to talk about an Elasticsearch upgrade. During this upgrade, we went from Elasticsearch 2 to Elasticsearch 5. Despite the fact that the numbers jump by a value of 3, the actual major version bump was only 1. At the time, Elasticsearch updated how it numbered its versions to match the underlying language that powers it. Even though we see the values go up by a value of 3, the actual major version is only 1. I want to call that out. We were not crazy enough to jump three versions. We might have deserved an incident if we did that, but we weren't that crazy, so I wanted to establish that.

Next, I want to touch on my previous company where this happened, Kenna Security, and the role Elasticsearch played for Kenna. Kenna was a cybersecurity company that helped Fortune 500 companies manage their cybersecurity risk. One of the defining features of Kenna is that we allowed our customers to search through all of their cybersecurity data in seconds, thanks to the help of Elasticsearch. Elasticsearch was the cornerstone of Kenna's platform, and what set Kenna apart from its competitors. I want you to keep this in mind as I tell this story, so you can grasp just how big of a deal what I'm about to tell you was for Kenna.

Outage (March 2017)

The year was 2017, and it was March. The time had come to upgrade our large 21-node Elasticsearch cluster. This Elasticsearch cluster, as I stated, was the cornerstone of the business, so there was a lot riding on this upgrade to go well. Preparation had been underway for months, getting the codebase and the cluster ready for the upgrade. Everyone was super excited. The reason we were all so excited was because the last time we had upgraded Elasticsearch, we saw huge performance gains. We couldn't wait to see what benefits we were going to get from upgrading this time. We chose to do the upgrade on a Thursday evening, that way we had Friday to work out any kinks in case they came up. The actual upgrade itself, which involves shutting down the cluster, updating all the nodes, then updating the codebase, went off without a hitch.

However, as soon as we took our app out of maintenance mode, we started to see a bunch of CPU and load spikes throughout the cluster. This was a bit concerning, but since the cluster was still doing some internal balancing, we figured, maybe that was the cause, and it would probably be sorted out in the morning. We called it a night. This takes us to Friday, March 24th. Much to our dismay, we woke up to find that overnight, some of the Elasticsearch nodes had maxed out on CPU and load and crashed. We again thought maybe this was due to that internal balancing. We dismissed it, restarted the crashed nodes, and continued to monitor the cluster. Everything was looking pretty stable until about 9 a.m., when site traffic started to pick up. Once again, more CPU and load spikes. Then just like overnight, we saw nodes crashing. Except this time, before we could even restart them, the entire cluster went down.

At this point, we knew something was very wrong, and we jumped into full-on debugging mode. We started combing through logs, reading stack traces. We got so desperate that we started taking heap dumps. You all know you're desperate if you're taking heap dumps. We gathered all the data we could, trying to find any hint about what might be going on with our cluster, and causing all of these CPU and load spikes. We tried Googling things, like Elasticsearch 5 upgrade, followed by cluster crash. Elasticsearch 5 cluster instability? Why does Elasticsearch 5 suck so much? Maybe not this last one. We Googled everything we could think of, but we couldn't find anything that explained what we were seeing. We came up with a few theories. We tried some fixes, but none of them worked. By Friday afternoon, we decided to reach out for help. We created a post on Elastic's discuss forum. We had no idea if anyone would answer.

As many of you know, when you post on a forum like this, or like Stack Overflow, sometimes you get an answer, and sometimes you just get crickets. Lucky for us, it was the former. Much to our surprise, someone did answer, one of the senior engineers at Elastic, Jason Tedor. We were overjoyed that one of the core developers of Elasticsearch was on our case. This forum discussion on the post turned into a private email, with us sharing all of the data we could with Jason, to help him figure out what was going on with our cluster. This back and forth stretched on through the weekend, and into the following week, all the way until Tuesday.

I'm not going to sugar coat it. During this time, our team was in a special level of hell. We were working 15-plus hour days, trying to figure out what was causing all of these issues. While at the same time, doing everything we could just to keep our application afloat. We were adding cache statements where cache statements had no business being. Anything we could do to keep load off of the cluster. Nothing was off limits. No matter what we did, that ship just kept on sinking. Every time the cluster went down, we had to boot it back up. Repeat, repeat, repeat.

As you can imagine, customers were not happy. Not happy during this time. Remember, searching data is the cornerstone of Kenna's platform. Customers couldn't view their data, they couldn't generate reports. Basically, besides logging in most customers, the platform was completely unusable. Our VP of engineering was constantly getting phone calls and messages asking for updates. With no solution in sight, we started talking about the R word. Was it time to roll back? Unfortunately, we had no plan for this. It can't be that hard. How wrong we were in that assumption. Surprise, we couldn't actually roll back. Though, not quite the surprise we were looking for. In order to roll back, we would have to stand up another cluster with the old version, and then copy all of our data over to it. We calculated this would take us roughly five days to do, which was not good news. With no other options, we stood up a new cluster with the old version, and we started copying all of our data over to it.

Then, Wednesday, March 29th rolled around. We are six days into this outage at this point, and our team is digging deep, despite the exhaustion, to just keep going. That's when it happened. We finally got the news we had been waiting for. Jason, the Elasticsearch engineer helping us, sent us a message saying that he had found a bug. Hallelujah. When making this presentation, pulling up this old message, it still gives me feels now, seeing it again. It's just immediate relief. The bug he found was in the Elasticsearch source code, and he immediately issued a patch and gave us a workaround that we could use until that patch was officially merged and released. When we implemented the workaround, it was like night and day. The cluster immediately stabilized. I think at this point, we were so happy, our entire team cried. This battle we had been fighting for nearly a week was finally over. Even though the incident was over, the learning had only just begun, because while it makes for a great story to tell, our team learned a few things from that upgrade. Those are what I really want to share with you.

Lessons Learned - Have a Rollback Plan, All the Time

First lesson learned, have a rollback plan. When doing any sort of change, you must know what rolling back in the event of a problem involves. Can you roll back the software with a simple revert PR? If you can't roll back with a simple revert PR, how would you handle that rollback? How long is a rollback going to take? If it's going to take a long time, that's something you're going to want to plan for. Basically, you want to worst-case scenario the crap out of large changes so that you're prepared for anything. It can be really easy as a software engineer to only focus on your code. The code for this upgrade could have been rolled back with a simple revert PR. We never considered if the Elasticsearch software and data itself could be rolled back. One of the methods that you can use for doing those worst-case scenario analysis is what some people call a pre-mortem.

Essentially, you think about handling failure before it happens, so that you are better prepared if it does. We actually have a version of this at Netflix called an FMEA, or Failure Mode and Effects Analysis. This exercise has many applications. One of them is that you can do it ahead of a large change to help de-risk it. During this exercise, you sit down with your team, you put your chaos hats on, and you think about all the ways your code might fail, or the change might go wrong that you are making. Once you have your list of failure modes, usually you capture these in a spreadsheet.

Then you go through and you prioritize them based on severity. Figure out, which one of these do we want to mitigate ahead of time, or, which one of these are signals that we should bail out? We should stop what we're doing and start rolling back. Those, maybe you want to invest in detecting as quickly as possible. Great exercise. I encourage folks to try the next time you're looking to make a big or a risky change. This outage taught me that rollback plans for large changes are important. My seven years since as an SRE has taught me that rollback plans are not just for big upgrades and risky changes. They should be part of your normal considerations all of the time when you're pushing to production. Even just considering, is my code backwards compatible? Can any part of my change not be rolled back? Is going to be beneficial.

On the topic of rollback plans, I have another poll. How many folks have exercised their rollback mechanisms within the last three months? How many of you have exercised them in the last six months? You all probably have high confidence they work, and you are prepared to leverage them in the event of an incident. For those of you who have not exercised your rollback plans recently, or mechanisms, I have one question for you, are you feeling lucky? Seriously consider how confident you are that they work and that you're equipped to leverage them in the event of an incident. I have a short story on this.

From my time at Netflix, I was involved in an incident once where we had to roll back data. I asked the service owners, can we roll back the data? They said, yes, no problem, we have a runbook for this. Great, when was the last time we did this? We've never done it before, but we wrote the runbook two years ago, it should be fine. You all can probably see where this is going. Fast forward over an hour later, we finally got the data rolled back after hitting multiple friction points, unexpected behavior. It was the works, but we got it done. The lesson here is, exercise those rollback plans regularly. Don't let them become Schrodinger's cat. Does it work? Does it not? I don't know. I'm just going to wait till an incident to find out. Don't do that. Find a reason, maybe it's quarterly, maybe it's once every six months, to exercise those rollback mitigative mechanisms so that you are confident that they work. Having a rollback plan all the time is lesson number one.

Do Performance Testing, Regularly

Lesson number two, do performance testing. No matter how stable and widely used software you are running is, you should performance test it. We did not think our usage of Elasticsearch was unique. I can confidently say that our data size was not super remarkable. We glossed right over this step. Don't let a small use case or perception lull you into a sense of complacency. My favorite SRE saying is trust, but verify. While the software might perform great for 99% of the folks using it, you never know if your use case might contain that small piece that has a bug, or is unoptimized. Companies much larger than us at the time were using Elasticsearch 5 when we upgraded, but none of them ran into the particular bug that we did.

After this incident, we developed a way to run shadow traffic against new clusters and indexes in order to performance test them for new changes. Doing performance testing is our second lesson. Again, similar to the topic of the first one, don't just do performance testing for large changes. Try to find ways that you can do it regularly. Explore ways you can detect performance regressions for any change, whether it's big or small. During my time as an SRE, I have seen more small dependency updates wreck more havoc than I care to admit.

If you can detect those issues early, you're going to save yourself and your team a whole lot of headache. One strategy we have for doing this at Netflix is long-running canaries. A lot of teams who leverage this strategy will actually run a canary overnight, and it's a great way to catch some of those smaller regression issues before they blow up into bigger problems. Another strategy that we use at Netflix, similar to my previous company, is shadow or replay traffic, where you mimic production traffic to servers with your new code or configurations, again, to test them under load and see how they handle it. These are two great tactical ways you can incorporate performance testing into your regular software release cycle. To avoid surprises, do performance testing regularly.

Be Wary of Previous Experience Bias, Always

The next lesson in this whole debacle was, be wary of previous experience bias. We went into this upgrade with some very wrong assumptions and biases. One of which was, since the last upgrade improved performance, this one will too. Software only gets better. We all know this is wrong. Because of our bias from that first upgrade, we were unreasonably confident. Another factor that played into those biases, again, was this idea that we were working with widely used well-known software that we trusted. It never entered our minds that that software could contain an application crippling bug. Lessons learned here, check those biases at the door. You can probably guess, given the pattern of the first two, I'm going to append this with always. Again, this was a major upgrade, but the lessons are applicable to any size of software change. We're going to get a little vulnerable here. How many folks will admit, when you're on a roll of successful software deploys, maybe you get a little comfortable and complacent? It happens to the best of us.

Being an SRE has taught me the value of keeping my guard up and maintaining just a little bit of skepticism. How can you help yourselves and others remain just a little bit vigilant as you're releasing software on a daily basis? Consider adding small reminders to PR templates or maybe your deploy flows, simple signals to remind folks to stay alert. Maybe you add these three emojis to the bottom of a PR template or a deploy confirmation box. Emoji is not your style, maybe you add a GIF instead.

The point is, it doesn't need to be big or formal. We don't need folks filling out TPS reports every time they want to push to production, that's too much. We just want a subtle reminder for folks to pause and just think about what could go wrong when they're releasing software so that they're prepared for it. Worst case, hopefully it makes them smile in the process. This gives us three lessons learned. Unfortunately, all of these lessons were learned the hard way by not following them during this outage. I can't stress this enough, we learned these lessons with a big bang. They are applicable to even the smallest of changes. I see teams relearning these lessons all the time. Keep them in mind even as you're conducting routine software changes.

Widen Your Circle

Now I want to shift gears because these next three lessons are things that actually went incredibly well during this outage. When I look back on the whole experience, it's pretty easy to see that if these next three lessons had not gone well, the outage could have been way worse. I'm sure some of you are thinking, excuse me, worse? You guys were down for nearly a week. Trust me, as I lay out these next three lessons, I think it'll become clear that it actually could have been worse. Lesson number four, widen your circle. Never discount the help you can find by widening the circle of folks involved in an incident. Kyle from Meta talked about this concept of escalating quickly. Very similar.

In this case, we went really wide, and we tapped into the Elasticsearch community. It can be hard and scary to ask for help, but don't let that stop you. Whether it is externally tapping into your network, reaching out to a vendor, or even talking to more folks or more teams internally at your company. Widen your circle as early as possible when you are troubleshooting. This is one aspect of this incident that we did really well.

The day after it started, we quickly widened our circle to that external community and asked for help by posting on Elastic's discuss forum. It took me maybe an hour to gather all the data and write this message. It was by far the most valuable hour of the incident, because this post is what got us our answer. Many people widen their circle as a last resort. I'm here to tell you, don't wait. Don't wait until your back is against the wall and you're at the end of your rope. Don't wait until you've suffered through hours or even days of downtime. Before you get to the end of your rope and find your back against the wall, be courageous, widen that circle, and ask for help, because you might save yourself and your team a whole lot of struggle and frustration in the process.

Throughout my time as an SRE, I have learned that as soon as you get the right person in the room, it is amazing how quickly things can turn around in an incident. I know I am looking at a lot of very experienced and well-tenured engineers. It can be hard to be vulnerable and ask for help. I struggle with it too. I've also seen the immense value it carries. Not only can you resolve your incident faster, but you're going to get a second benefit, one that you're probably never even going to realize. You showcase to all of those young and aspiring early career engineers that it's ok not to know. We're all human and we're all fallible. Asking for help when you need it is a strength. Widen your circle as early as possible and ask for help.

Your Team Matters

Number five in the list, I want to highlight the incredible team of engineers that I worked with through this outage. It goes without saying, during any incident, your team matters. We all know being an engineer is not just about working with computers. You're also working with people. This outage really drove that point home for me. It was a team of three of us, and we were working 15-plus hour days for nearly a week. We went through every emotion in the book, from sad to angry to despondent. Rather than these emotions breaking us down, they banded us together. We all supported each other when we needed it.

That was a big aha moment for me because it made me realize that character is everything in a time of crisis. You can teach people tech. You can show them how to code. You can instill good architectural principles in their brains, but you can't teach character. When you're hiring, look at those people you're interviewing. Really get to know them, and assess, are they someone you would want to be back against the wall fighting a fire with? Will they jump in when you need them to, no questions asked? If the answer to that is yes, look for a reason to hire them, because that is not something you can teach. Keep in mind that your team matters, and your team is also a part of your resiliency picture.

Leader and Management Support are Crucial

The final lesson learned is one of the most important, which is why I saved it for last. It is yet another one that we nailed during this outage. Leader and management support are crucial. During any incident, no matter how big or small, the leadership and management team you have backing your engineers is extremely important. When you look at this outage as a whole, it's easy to focus on the engineers and everything we did to fix and handle the outage. One of the key reasons we as engineers were able to do our jobs as effectively as we were was because of our VP of engineering. It wasn't just the engineers that rallied during this incident.

Our VP of engineering was right there with us the entire time. He was up with us late in the evenings. He was online at the crack of dawn every single day. He not only was there to offer help technically, but more importantly, he was our cheerleader. There were many times when our team wanted to throw in the towel and give up, but he kept encouraging and pushing us because he truly believed we would figure it out. He was also our defender. He fielded all of those questions and messages from upper management, which allowed us to focus on getting things fixed. He shielded us from all of that additional worry and panic that other stakeholders were experiencing during the outage. Above all else, our VP never wavered in his trust that we would figure it out. He was the epitome of calm, cool, and collected the entire time, and that is what kept us pushing forward. If we had had a different VP, I am sure things would have turned out very differently.

I want to see what sorts of leaders we have. Raise your hand if you're a technical IC leader. We've got some technical IC leaders. How about managers? We got some managers. Directors, do we have a couple of directors? We're going to give these tips upward then. Any VP pluses? Managers, technical IC leaders, you're going to have to pass this information up. Often, you will not be the ones pushing the code or taking mitigative action during these incidents. I guarantee the role you play is much larger and much more critical than you know. How you react is going to set the example for the rest of your team. Be their cheerleader. Be their defender. Be whatever they need you to be for them. Above all else, trust that they can do it. That trust will go a long way towards helping the team believe in themselves. That is so crucial for keeping morale up, especially during long-running incidents. This lesson has deeply impacted me.

As a leader in these situations now, you have the empower to impact those around you in ways you can't even imagine. This takes me back to one of my favorite sayings, "People don't remember what you did, they remember how you made them feel". That quote couldn't be more true when it comes to how this incident is imprinted in my memory. Did we learn technically from it and improve our systems? Of course, we did. The legacy of this incident truly lives on in the engineers that it impacted. We all witnessed leadership lead with compassion, empathy, and a we-will-figure-it-out attitude during that incident. That type of leadership is now one that every engineer on that team tries to emulate. It is something that has influenced me into becoming an SRE. Now it's my job to support engineers during an incident. When I lead an incident, yes, I want to mitigate. Yes, I focus on comms. First and foremost, my priority is supporting those engineers around me. Remember this saying the next time you are involved in an incident, and lead with compassion and empathy.

Summary

With that final lesson, our list of lessons that we learned from this outage is complete. On the technical side, have a rollback plan for all changes, big and small. Do performance testing regularly. Be wary of those previous experience biases always. On the non-technical side of the equation, widen your circle as early as possible during an incident. Know that your team matters. Remember that leader and management support are crucial. While the technical side is important, I firmly believe that these last three points are probably the most important. The reason I say this is because despite doing these three things, there will still be times when things go wrong. It's inevitable in this line of work, things are going to break. Incidents are going to happen. We all know it's not a question of if, but when. When things do go wrong, if you widen your circle by asking questions early and often, and you have the right team and leadership in place, you can survive any outage or incident that comes your way.

The Elasticsearch outage of 2017 is infamous at my former company, but not in the way that you would think. As brutal as it was at the time for the team and the company, it helped us build a foundation for the engineering culture. It gave everyone a story to point to that said, this is who we are. What happened there, that's us. It created a deep sense of psychological safety, and because of that, that team went on to build incredible software at that company and beyond.

With that in mind, I think there's a bonus lesson here, and that is, embrace your incidents. When I say your here, I mean you personally, but I also mean your team and your company. This incident was a team miss. We own that, and we embrace that. I find that in our industry, at some companies, outages and downtime can be taboo. When they happen, someone just, hopefully, scribbles out a post-mortem, sometimes they don't, and just sweeps it under the rug and hopes people will forget about it. That's not how it should work. Embracing incidents is the only way we can truly learn from them. Sharing these stories with others and being open about them benefits everyone.

Think of incidents as a withdrawal from your availability piggy bank. You just withdrew some money, now what are you going to do with it? Are you going to squander and throw it away, or are you going to invest it? Embracing incidents gives you that opportunity to leverage those lessons, grow from them, and invest in your team and your software. As you continue on your journey of building incredible software, remember these six lessons. They will help you catch problems early, and more importantly, empower you and your team to handle any incident that arises. When an incident does occur, no matter how big or small, embrace it, learn from it, and share it with others.

Questions and Answers

Participant 1: I went through this similar incident when we were doing a legacy platform upgrade. We had to update some legacy endpoint that forward the logs to our logging product of the entire platform. That code was never touched, like when it was created, up to 4 to 5 years, no one revisited that code, made any changes. We thought updating the endpoint is just a small change, and we tested it against sandbox, QA, and other lower environments. It went fine. The production upgrade was planned on Thursday afternoon or evening. In production, it essentially created a new environment because the code never visited, the logic and the dependency reports were archived, and it created five to six times of the logging spike in the software. That resulted in like 150K in the licensing cost. We went through a similar root cause analysis. I personally documented everything and the lessons learned. Have you ever thought never do the upgrades just before the weekend?

Molly Struve: I think that's something you always think about, but at least part of the downtime was over the weekend when customers didn't want to use it because we were enterprise customers and not working on the weekend. Obviously, if you're going to have a week of downtime, there's no good time to do it. I still think buffering on the weekend at least gives you a little breathing room. I can't imagine if we had had to do that with the stress of like being a workday every single day during that.

Participant 1: Because I mentioned as one of the points, lessons learned at the last, never do the platform upgrade just before the weekend, or Thursday or Friday, because we came to know the log spikes on the licensing cost on Monday afternoon. It went through all the weekend. No one didn't realize. Then, after that, we put some budget alerting on that. Also, another lesson learned, or this is one of the important things I want to share. After the incident happened, it just limited to the platform, but we created a meeting called Incident of the Month to share all our findings across platform and product teams. That culture has still been going on for the last 11 months. We came to know a lot of visibility, what's happening with other products and upgrades.

Participant 2: What did your VP do or not do in the time of crisis that made sure the focus of the team did not deviate from the issue at hand to other things that go in your head when the problem lasts for five days to seven days?

Molly Struve: As I mentioned, being that buffer, so really being able to field all of those calls and text messages, allowing us to focus on mitigation. Also, I look at the team in that case. He hired a team that when we were faced with adversity, banded together and doubled down. We wanted to fix that. We all felt responsible and accountable and took a lot of pride in our work. I think it's a combination of, he hired the right team for that situation.

Then also being that defender for us and just showing up. I tell this to leaders, because some leaders do want to get involved in an incident. I always tell them, if you're going to come into an incident channel, if you're going to show up in an incident call, the best thing you can do is just lay the foundation for what you're there for. "Hey all, I'm here to support. I'll be camera off. Let me know if you need anything". Literally just showing up as a leader and just saying something like that. I'm just here to support. That is, I'm not going to get in your way. I'm not going to micromanage you. That can be super helpful for a team, and just help them feel like you're there in solidarity. You're there as a leader. You're feeling that pain as much as they are. As an engineer, then you're like, ok, he believes in us to do this. He's here with us. That really helps keep morale up, especially during a long-running incident.

Participant 3: Long-running incidents highlight that our incident practices might not be sustainable. That's usually ok when the incidents are really only a couple of hours, but when they're going on for days, that turns into an issue. Do you have any recommendations for sustainable practices during incidents and when to make the transition between something that might be short and unsustainable and something where you need a long-term plan?

Molly Struve: Anytime, I think you're going past that hour or two-hour mark, that's when your radar has to go up, "We could be in this for a while". That's when it's really important to start taking stock. Who is here that is helping? Who is here that maybe doesn't need to be here? Because that's when you have to start making plans for, are you going to have to rotate people in? I think we all remember Log4j. That was something that extended a long time. That's a situation where, ok, we had engineers that were online and we had to tell them, I need you to go sleep right now. I need you to page in someone else for your team. Being aware of team health, the basic needs. Really, once you're over an hour, two-hour mark, start keeping an eye out for that and start making plans.

Recently at Netflix, we're rolling out this incident management for leaders workshop. That's a great way you can all help. Being there in the background, just keeping tabs. Who from my team is online? Who's in the wings that we can call on? Then checking in on people. Do you need to step away for food? Do you need to take a break? For leaders, that's another great way that you can support your teams during those long-running incidents.

Participant 4: When you talk about having leaders on the call, but even just observing, sometimes that can be enough to make some people feel more apprehensive if there isn't enough psychological safety, especially like, I can talk in front of my team, but now the boss is watching. How do you establish that on a day-to-day basis so that when you get into an incident, it doesn't change the behavior you've instilled in the culture?

Molly Struve: That's all going to start with building that psychological safety. I think a really great way leaders can do that is talking about incidents. Don't just talk about them when they happen. Talk about them regularly in meetings, focusing on the fact that incidents are those opportunities for learning, celebrating the learnings that come out of incidents. I think those are all little ways managers, leaders can help create what I like to call an incident positive atmosphere. Once you create that psychological safety, then if you show up in an incident, your team really knows you're there for support. Even if you're at a company where maybe you have a really good incident culture, it's really important to be aware that as you're bringing in new engineers, they might be coming from places that don't have a good incident culture. Again, going back to just being vocal, reiterate that, especially with new team members, so that you double down on that, this is a safe space. We're incident positive here. We look at these as opportunities and not as bad things.

Participant 5: Do you recommend any other pre-mortem formats besides FMEAs?

Molly Struve: The whole concept is just thinking about failure ahead of time so that you're prepared for it. I think you could do it in whatever format you want. You could do it async. I do find that when you get teams together talking about it, like live in a room, that all of a sudden new ideas will come up. One of the ways we do it is everyone writes down their ideas, then you get in a room and you talk about them, and then you get new stuff. That's the one that I have exercised. SEV0 is a conference, and one of their talks was on a smaller pre-mortem process. I think it's something you could definitely look up, and I think you'll find some literature on different ways companies think about that exercise.

Participant 6: My question is about widen your circle, also reaching out to the right people. In your case, you reached out to this person from Elasticsearch who's outside your company. Sometimes when I try to reach out to people, I feel like it may or may not be their issue, and I'm disturbing their routine operations. Especially being an incident manager, when I reach out, the expectation is they got to drop everything they're doing and take a look into the issue. I'm trying to build confidence here so that they don't feel like this person's reaching out. It's usually not my issue. I keep bothering them. How do I find a balance? Especially in your case, you reached out to the open forum, but you got help. The percentage of getting help is really low.

Molly Struve: I was talking to someone about this and reflecting on how lucky we got that someone responded to us. I think in this situation, again, critical, SEV0 incident at this point. I think when you talk about maybe reaching out internally, reserving those reach-outs for those times when it's absolutely critical, and then for the lower severity incidents, invest maybe more time in trying to figure out those on your own. When you are in those SEV0S or SEV1s, just start firing off those escalations, pulling people in. Hopefully in my experience in the companies I've been at, everyone wants to help. When you're in one of those SEV0 situations, if you page in one person, they come in and they're like, no, it's not us. They get it. They're like, we're all hands on deck. I think that's one way you could try to help balance that so you don't have people start to shrug you off, is saving it for those really critical moments. Then use those lower SEV1s to try to invest in maybe your understanding.

See more presentations with transcripts

Recorded at:

Apr 28, 2026

Molly Struve

InfoQ Software Architects' Newsletter

Week-Long Outage: Lifelong Lessons

Summary

Bio

About the conference

Transcript

Setting the Stage

Outage (March 2017)

Lessons Learned - Have a Rollback Plan, All the Time

Do Performance Testing, Regularly

Be Wary of Previous Experience Bias, Always

Widen Your Circle

Your Team Matters

Leader and Management Support are Crucial

Summary

Questions and Answers

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ