InfoQ Homepage Articles Practical Postmortems at Etsy

DevOps

Practical Postmortems at Etsy

Aug 22, 2015 24 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

DevOps is a movement. DevOps is a mindset. DevOps is devs and ops working together. DevOps is a way of organizing. DevOps is continuous learning.

The intangibility of DevOps makes it hard for leaders to come up with a clearcut roadmap for adopting DevOps in their organization. The meaning of DevOps is highly contextual, there are no popular methodologies with prescribed practices to abide by.

However, healthy organizations exhibit similar patterns of behavior, organization and improvement efforts. In this series we explore some of those patterns through testimonies from their practitioners and through analysis by consultants in the field who have been exposed to multiple DevOps adoption initiatives.

This InfoQ article is part of the series "Patterns of DevOps Culture". You can subscribe to receive notifications via RSS.

The days of running a web site on a single computer are basically over. Any basic architecture of a modern site consists of a myriad of systems interacting with each other. In the best case you start out with a web frontend and a database representing your initial feature. And then you add things: features, billing, a background queue, another handful of servers, another database, image uploads, etc. And of course you hire more people to also work on all of this. And this is at the point when you realize you work in a complex socio-technical system. You did from the beginning, but it starts to get a little more obvious now. As interactions between components become more complex, the lack of understanding of how the whole system works shrinks, and you start to witness emergent behaviour that you weren't aware of before. Behaviour that can't be explained by just looking at the single components but are a result of the interplay between them. Things start to break a lot more often.

Traditionally it has never been great to have things break. You are suddenly confronted with a situation you thought wasn't possible. Because if you did, you would have put guard rails and protections in place for this. Everybody is caught by surprise. You want the business to run, you want the site to be up. You are working hard to fix the problem, while your Amygdala is hijacking your brain. Maybe people are running around in the office or are asking question in your chat system. How could this have happened? Why didn't we have tests for the code path that broke? Why didn't you know about that failure case? When is the site going to be back up? Eventually everything will recover and your web application is humming along nicely again. Time to talk about what happened.

In what has been called the "Old View" - the traditional way of handling the aftermath of such an outage - we would now come together and yell at the person who was maintaining the system that broke. Or the one who wrote the code. Or the one working on fixing it. Surely if they had just done a better job, we wouldn't have had that outage. This usually ends in a very unproductive meeting where afterwards everybody feels worse. And on top of that you didn't even find out what really happened.

This is why in the "New View" of approaching systems safety - the foundation of what is now commonly known as "blameless postmortems" - we take a different route. The fundamental difference is that we don't stop at "human error" as the reason for why something broke. Humans don't generally come to work to do a bad job. Nobody sets out to bring the site down when they come into the office in the morning. The fundamental assumption we must make when we go into a blameless postmortem debriefing is that whatever decisions people made or whatever actions people took, they made sense at the time. They believed to make something better, fix something, deploy a change that was flagged off in production, deleted a file that wasn't referenced anymore. If we had a time machine and could go back to ask that person if their change would break the site, they would tell us "no way". Otherwise they would not be deploying that change. Being able to point to that change in the debriefing as the root cause of the outage is a function of hindsight and the encompassing hindsight bias at play. The Austrian physicist and philosopher Ernst Mach said in 1905:

knowledge and error flow from the same mental sources; only success can tell one from the other.

Meaning that any action you take can always be judged as a success or a failure, depending on the outcome. Focusing on the action itself and the human as the perpetrator doesn't give us any advantage in learning how the incident came to be. Even more so, a person feeling like they are being judged will not readily talk about all the influences that went into the decisions they made. They will try to get out of that meeting as fast as possible.

This is why we focus not on the action itself - which is most often the most prominent thing people point to as the cause - but on exploring the conditions and context that influenced decisions and actions. After all there is no root cause. We are trying to reconstruct the past as close to what really happened as possible. A challenge that is only made harder by the human brain's tendency to misremember things. If one person can do what seems to be a mistake in the face of hindsight, then anyone could have done it. And so we are at a point where we can punish the person pushing the button, and the next person who does it, and the next person, and the next person. Or we can try to find out why it made sense at the time, how the surrounding system encouraged or at least didn't warn about impending problems and what we can do to better support the next person pushing the button. We can have an open and welcoming exchange where we treat the person who supposedly broke the site as the actor closest to the action and thus most knowledgeable about the surprise we just uncovered. This is one of the biggest opportunities we have to learn more about how our socio-technical system behaves in reality and not just in theory.

Blameless Postmortems at Etsy

At Etsy we strive to make postmortems as open and welcoming as possible. This means there are no restrictions on who can attend (save for the limit of people that fit into a conference room, and it's not rare that we have standing-room only postmortems). There are only rules around the minimum number of people who need to be there. As this is a learning event, it doesn't make sense to talk about what happened if the people most knowledgeable about the past aren't in the room. So everyone who worked on fixing the outage, helped communicate, got paged, or contributed otherwise to how the situation went down needs to be there. This is to ensure we learn the most we can out of what happened, and that we get the most accurate timeline of events possible.

Usually after an outage the people closest to the action start an entry in our postmortem tracker tool "Morgue". It's a simple PHP application we also open sourced that keeps track of all our postmortems and their associated information. It stores the date when the outage started and when it ended, has fields for a timeline, IRC logs, graphs, images, remediation items (in form of Jira tickets for us) and much more. While everything is still somewhat fresh in people's minds the timeline is prepared with all the IRC channels communication happened in (morgue automatically pulls in all the logs from start to end time for configured channels) and the graphs people were looking at at the time to make sense of what was going on. In this stage it's important to add as much information as possible and not cut things out prematurely. The timeline will already be biased, influenced by how the people preparing the postmortem experienced the event. We want to keep this bias as small as possible and thus try to include as much information as possible. They also set up the event (we have a specific calendar just for postmortems), book a conference room and invite all participating parties to it as well as the Postmortems mailing list so anyone interested can also come. They also request someone to facilitate the postmortem (more on that later) via our facilitators mailing list. Handily Morgue already includes all those features to automatically schedule a meeting and request a facilitator. We've tried to make it as easy as possible to set up a postmortem and are still working on making it easier.

The Meeting

Once everyone is assembled in the meeting room and ready to discuss what happened there the facilitator establishes parameters and guides the whole discussion. We begin by making sure everyone understands the gist of what this meeting is about. This is not only important for people who come to a postmortem for the first time but a good opportunity to reiterate for everyone. We state very clearly that this is about learning from surprises we uncovered. It's important to establish that most of our time (30-40 minutes in a 60 minute meeting) will focus on reconstructing the timeline, critical to get right as we will base any possible remediation items on our version of this timeline. If we don't get the timeline right - or at least as close as possible to what really happened - we will get less effective remediation items and impact the overall success of the postmortem.

Of course, for all this, it is also important to mention that no matter how hard we try, this incident will happen again, we can not prevent the future from happening. What we can do is prepare: make sure we have better tools, more (helpful) information, and a better understanding of our systems next time this happens. Emphasizing this often helps people keep the right priorities top of mind during the meeting, rather than rushing to remediation items and looking for that "one fix that will prevent this from happening next time". It also puts the focus on thinking about what tools and information would be helpful to have available next time and leads to a more flourishing discussion, instead of the usual feeling of "well we got our fix, we are done now".

With those clarifications we then start to talk about the timeline. Here we are mostly following along with the IRC transcripts. As we use IRC heavily at Etsy, it contains the most complete recorded collection of conversations. As we go along the logs, the facilitator looks out for so-called second stories - things that aren't obvious from the log context, things people have thought about, that prompted them to say what they did, even things they didn't say. Anything that could give us a better understanding of what people were doing at the time - what they tried and what worked. The idea here being again that we want to get a complete picture of the past and focusing only on what you can see when you follow the logs gives us an impression of a linear causal chain of events that does not reflect the reality.

The Facilitator's Role

The facilitator's role here is really to guide the discussion and make sure we don't fall back into 'Old View' patterns of thinking, something that happens often as it's the approach to debriefings that most of us learned in the past. The facilitator should look out for two of the most important and fortunately also most practical things: hidden biases and counterfactuals. The human mind is full of cognitive biases, and there are enough to fill a whole list on Wikipedia. While it'd be beyond our scope here to go into detail about every single one of them, we will just discuss some examples that in our experience are very common and comparatively easy to look out for, and provide insight on the best ways to work around them.

Beware Biases

One of the most known and frequent biases to encounter is Hindsight Bias. Also sometimes called "knew-it-all-along effect" Hindsight Bias is the tendency to declare an event as predictable when it is discussed after the fact. This occurs often because often during the debriefing new information becomes available to people involved that previously wasn't, when the incident was going on. This means that although it might seem obvious to a participant in the debriefing why a deploy might break the site, this wasn't obvious to all at the time. In this moment, the facilitator's role there is to watch out for the bias and point out that it's not based on information that was known at the time, but additional information that we know now. This doesn't mean this is useless information; we will and should incorporate that in the discussion and remediation items at the right time. But during the timeline discussion, this habit fuels Hindsight Bias, and it's not useful to our goal of learning what really happened.

Another very common bias is Confirmation Bias. It describes the tendency to look for and favor information that supports one's own hypothesis. It's not surprising that this is a very common bias - we all know it feels much better to be right than wrong. The risk of following Confirmation Bias means the recollection is often skewed, a one-sided version of the past, biased by the memory and hypothesis of a single person or a small group of people. It doesn't allow for the full story, and will lead to a misunderstanding of what really happened, thus lessons and remediation items will not effectively support the improvement of actual systemic shortcomings. They most often just lead to a "feel good fix" that is less effective or might even be adversarial next time someone finds themselves in a similar situation. An example for this might be adding an alert that doesn't actually check the health of a component relevant to what is being worked on. So next time someone is in the same situation they might additionally think everything is working great as the alert is not firing.

In this case, the facilitator has to look out for this bias which often comes in the form of following one line of thought in the timeline, for which there is little evidence or information, thus focusing on confirming a hunch or opinion instead of basing those on facts. It is important to keep the timeline discussion on track and rooted with what we can see in logs, graphs, and chat transcripts. This way we can try to make sure everybody's contribution to the timeline is discussed evenly and we are not trying to follow one person's idea of the past.

A third very common bias to look out for is Outcome Bias. It describes the tendency to evaluate the quality of a decision based on its outcome and not based on the information the person had at the time. Applying Outcome Bias means that two people doing the exact same thing (e.g. pushing the button to deploy the site) are judged differently based on whether or not their change broke something. This is harmful especially because it doesn't focus on what a person knew at the time and thus the whole foundation of decision-making at that time. It's akin to that often sought but rarely available skill of predicting the future. If we don't try to make sense of the past based on actual available information, there is hardly any way through which we can make improvements for the actors at the sharp end.

A good way as the facilitator to look out for Outcome Bias is whether people are asking about the thought process of deciding to do something and what sources of confidence they thought out for their actions. Thus focusing on the action and how it made sense rather than the outcome. Outcome Bias is also extremely common for the actors closest to the action themselves. Confronted with the outcome of their actions, humans are quick to judge their past actions as a bad choice. Here the role of the facilitator is especially important, as they will remind actors how actions made sense at the time and would have been executed by anyone else in the same manner.

It's important to keep the debriefing a safe space to share information so we can all learn. As we have all probably witnessed the Old View of approaching outages in previous jobs it is a natural tendency to not feel safe in a debriefing. We are often enough victims of our own biases when we look back on our actions or fear being judged by our peers. Deconstructing biases as soon as they arise is paramount in this endeavour to make a debriefing a place of trust and learning.

Counterfactuals

In addition to looking out for specific biases, another very practical way to keep the discussion efficient and targeted is to watch out for counterfactuals, or statements that are literally counter to the facts of what happened in the past. Looking out for counterfactuals is one of the easiest things to spot as it is one of the hardest things to avoid. Common phrases that indicate counterfactuals are "they should have", "she failed to", "he could have" and others that talk about a reality that didn't actually happen. Remember that in a debriefing we want to learn what happened and how we can supply more guardrails, tools, and resources next time a person is in this situation. If we discuss things that didn't happen, we are basing our discussion on a reality that doesn't exist and are trying to fix things that aren't a problem. We all are continuously drawn to that one single explanation that perfectly lays out how everything works in our complex systems. The belief that someone just did that one thing differently, everything would have been fine. It's so tempting. But it's not the reality. The past is not a linear sequence of events, it's not a domino setup where you can take one away and the whole thing stops from unraveling. We are trying to make sense of the past and reconstruct as much as possible from memory and evidence we have. And if we want to get it right, we have to focus on what really happened and that includes watching out for counterfactuals that are describing an alternative reality.

Guidance

Throughout discussion of the timeline, the facilitator acts as a guide. While facilitating you want to ask clarifying questions about why people took different actions, what their assumptions were, what the intention of that action was, what sources of confidence they thought out, what tools they had to improvise, which graphs they looked at for confirmation, why they thought using one tool over the other made more sense at the time. Especially in a situation where multiple people were debugging the same thing it is extremely useful to ask those questions, as everyone has a different view of the world. While one engineer will debug a given problem with strace and tcpdump, another will write a small perl script to do almost the same thing. This is a great opportunity to detect common things that were missing and subsequently improvised by multiple engineers, which can be surfaced in the discussion to come. A designated note taker is a good idea here: All of these things should be noted without commentary or bias. As a facilitator you will be completely occupied with following the timeline and the thought processes of people involved and detailed note taking simply may not be possible. However having those notes is extremely important in the discussion following the timeline as they provide another set of facts to base improvements and remediations on.

Once you have arrived at the end of the timeline, most often indicated by the end of the chat transcript, everyone has to agree on the discussed timeline before the meeting can continue. Accuracy and agreement is extremely important as we will base the upcoming discussion and any remediations on that timeline.

The Discussion

Following the timeline review, we hold a discussion to go deeper on some of the details and information we uncovered during the timeline reconstruction. This is the time to go back to observations and notes, explore if there is a lack of tooling or resources and room for improvement. While backed by the timeline, the discussion is not restricted to only that content, there is helpful context we can pull in to further ideation. This discussion doesn't follow a strict format but is guided by questions that can be especially helpful, including: "Did we detect something was wrong properly/fast enough?", "Did we notify our customers, support people, users appropriately?", "Was there any cleanup to do?", "Did we have all the tools available or did we have to improvise?", "Did we have enough visibility?". And if the outage continued over a longer period of time "Was there troubleshooting fatigue?", "Did we do a good handoff?". Some of those questions will almost always yield the answer "No, and we should do something about it". Alerting rules, for example, always have room for improvement - they were created under a certain set of assumptions that also influenced the creation of the very system being monitored. Thus it is almost certain that not all emergent behaviour was accounted for. Ideally you'll have specialists for the system that was broken as well as specialists for your alerting system both present in the room. This is a great opportunity to have a quick discussion about possibility and feasibility of adding an alert to aid detection in the future.

There is no need (and almost certainly no time) to go into specifics here. But it should be clear what is worthy of a remediation item and noted as such. Another area that can almost always use some improvement is metrics reporting and documentation. During an outage there was almost certainly someone digging through a log file or introspecting a process on a server who found a very helpful piece of information. Logically, in subsequent incidents this information should be as visible and accessible as possible. So it's not rare that we end up with a new graph or a new saved search in our log aggregation tool that makes it easier to find that information next time. Once easily accessible, it becomes a resource so anyone can either find out how to fix the same situation or eliminate it as a contributing factor to the current outage. At the same time discussing a certain system will surface even more people who are the experts in using and troubleshooting it. And in return this means that there is likely an opportunity for that person to document their knowledge better in a runbook, alert or dashboard. And here is an important distinction: this is not about an actor who needs better training, it’s about establishing guardrails through critical knowledge sharing. If we are advocating that people just need better training, we are again putting the onus on the human to just have to know better next time instead of providing helpful tooling to give better information about the situation. By making information accessible the human actor can make informed decisions about what actions to take.

Remediation items

After the discussion is done and everybody has shared ideas about what to improve in the current system it's time to talk about remediation items. Remediation items are tasks (and most likely tickets in your ticketing system) that are specifically about remedying the item; each has to have an owner. While writing those tickets down it is often useful to try to adhere to something like the SMART criteria for specification and achievability. Especially in the state of just having uncovered a potentially scary part of your system, you want to refrain from complete rewrites or replacing a specific technology as a whole. Remediation items should be time constrained, measurable, relevant to the actual problems at hand and fix a very specific thing. It's not helpful if an engineer ends up with a very high level ticket like "make the web server faster" or something that can't reasonably be achieved in a constrained amount of time like "rewrite the data layer of our application". A good rule of thumb to start out with is that a remediation should be able to be completed in 30 days. Many may be done sooner, or still open after 6 months. This is ok and just means that you have to sit down and figure out if your time constraints make sense and whether or not a remediation item that has been in the queue for 6 months is actually a good and specific-enough task.

Remediation items are a way to document surprises you have uncovered during the postmortem and to fix shortcomings in the system to support the human operator next time something similar happens. They are not able to prevent the future from happening and they are not a guarantee that a particular outage will never happen again, but they help equip and enable appropriate responses.

Summary

This is a very quick rundown of how we implement blameless postmortems at Etsy. Over the last 5 years we have iterated on this process quite a lot but it has been working great for us. It's important to keep in mind that this is not a process written in stone. A lot of inspiration for our process has come from academic work as well as books like Sidney Dekker's "Field Guide to Understanding 'Human Error '”, which is a great start if you are looking to dive into the topic as well. But we are also constantly thinking about how to make it better and where it needs improvement. Though we establish a process, no two postmortems are the same. We generally have a one hour meeting for our debriefings and split it up as 30-40 minutes of timeline discussion, 10-15 minutes of discussion and 5-10 minutes writing down remediation items. While this often works well, it sometimes doesn't at all. Debriefings can illuminate a more complex incident than originally thought. Remediation items can end up fixing the wrong thing or be deprecated by a change in the system before they’re even implemented. We've run out of time before getting to remediation items and had to schedule a follow up, we have (rarely) been done after 45 minutes, we've had postmortems scheduled for 2 hours and still run over. We have a general rule to implement remediation items within 30 days but have had unhelpful tickets closed after a year of inactivity. Altogether, it’s clear: it would be ludicrous to think we can estimate and fit problems of arbitrary complexity into a fixed frame. The key to success is establishing a process that is adaptable and continuously revisit to ensure whether it still is helpful in its current form. We often have other facilitators sit in postmortems and then give the facilitator feedback after the meeting. We write tooling around how we want the process to look. And we try hard to be as prepared as possible for the next outage, as the next outage will come and there's nothing we can do to prevent it. And that's fine. We just have to make sure we learn as much as possible from them.

About the Author

Daniel Schauenberg is a Staff Software Engineer at Etsy's infrastructure and development tools team. He loves automation, monitoring, documentation and simplicity. In previous lives he has worked in systems and network administration, on connecting chemical plants to IT systems and as an embedded systems networking engineer. Things he thoroughly enjoys when not writing code include coffee, breakfast, TV shows and basketball.

DevOps is a movement. DevOps is a mindset. DevOps is devs and ops working together. DevOps is a way of organizing. DevOps is continuous learning.

This InfoQ article is part of the series "Patterns of DevOps Culture". You can subscribe to receive notifications via RSS.

InfoQ Software Architects' Newsletter

Practical Postmortems at Etsy

Follow us on

Related Sponsors

Blameless Postmortems at Etsy

The Meeting

The Facilitator's Role

Beware Biases

Counterfactuals

Guidance

The Discussion

Remediation items

Summary

About the Author

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter