Key Takeaways
- More and more software systems are becoming “safety-critical”
- Software teams generally analyze failure in ways that are simplistic or blameful
- There are many intuitive traps teams may fall into, particularly the idea that human error is to blame for incidents
- The language used by investigators and facilitators is crucial to learning
- Teams can protect learning by keeping repair planning discussions separate from investigation
Jessica DeVita (Netflix) and Nick Stenning (Microsoft) have been working on improving how software teams learn from incidents in production. In this article, they share some of what they’ve learned from the research community in this area, and offer some advice on the practical application of this work.
In January 2018, DeVita attended a Learning Lab on Critical Thinking in Safety at Lund University. The Learning Lab introduces attendees to the latest thinking in the fields of safety and human factors research, and she was struck by how much the world of software had to unlearn before we could learn effectively from incidents and outages.
Common responses to incidents in software remain centered in a basic cause and effect philosophy and prioritise the identification of repairs over effective learning. In many safety-critical industries—aviation, healthcare, firefighting—investigators and analysts have long since moved beyond these concepts, which are inadequate to understand what is happening in the complex socio-technical systems they study.
Stenning is a software developer working on production systems and is currently a site reliability engineer (SRE) at Microsoft. While at Microsoft in 2019, DeVita worked with Stenning to develop teaching materials that introduce software teams to some of the central ideas from this research community, and offer them concrete and practical advice on how to learn more from incidents.
For both Stenning and DeVita, the motivation for this work is easy to explain: software is already a safety-critical industry, so we better start acting like one. As hospitals, transport networks, and even whole cities start to rely on the cloud, outages and incidents become potentially life-threatening events. This reality has been brought into sharp focus in 2020 by COVID-19 as response teams used the cloud to support vaccine research and contact tracing efforts.
The rest of this article shows us how we can learn from the missteps of other industries and work to create an environment where learning from what’s happening in the operational context, whether from successes or near-misses or failures, is expected and rewarded.
Safety Science and Resilience Engineering
Researchers often cite the Three Mile Island nuclear accident in 1979 as the most important moment in the development of modern safety science. The partial meltdown of TMI unit 2 in Pennsylvania prompted a wave of research into the role played by humans in engineered systems. To grab software developers’ attention, let’s reach even further back into history, looking at the crashes of B-17 “Flying Fortress” aircraft during WWII.
Military researchers including Paul Fitts (of the famous Fitts’s Law) and Alphonse Chapanis investigated a series of hundreds of accidents involving B-17s, almost all of which had been attributed by contemporary investigators to “pilot error.” In these accidents, B-17 pilots were crashing their aircraft either by retracting the flaps during landing, or by raising the landing gear while on the ground. The accidents had kept happening despite numerous investigations. It wasn’t until Fitts and Chapanis compared the experiences of B-17 pilots with those of other aircrafts that they identified an overlooked problem: the cockpit controls for the landing gear and the wing flaps were far too easy to confuse for one another (1).
This was perhaps the first research effort that concluded “human error” (or, in this case, “pilot error”) was a misguided label in accident investigation. And yet, nearly three-quarters of a century later, it is still common to hear “human error” cited as a cause of production incidents in software.
In that 75-year period, and particularly since the TMI-2 accident, the research community has expanded to study industries from aviation to healthcare, marine engineering to firefighting, and, most recently, software operations (2). The theme running through much of this research is that the role of humans in engineered systems is extraordinarily complex, and portrayal of the humans as the weakest and least reliable component of any system of which they are a part is naive and misleading. Our intuition for these issues is poor, and there are many compelling cognitive traps for us to fall into when investigating incidents.
Traps and Pitfalls
The story of the B-17 crashes has identified one common trap: the acceptance of “human error” as a cause of or contributor to an incident. We know that humans make mistakes, so simply identifying an error is uninteresting. The “new view on human error” explained by Sidney Dekker in The field guide to understanding ‘human error’ (3) shows that when we are tempted to attribute a problem to “human error,” we can instead look more closely at the environmental or contextual factors that contributed to making the error possible or likely.
Another common trap is “counterfactual reasoning,” which is where investigators and participants talk about scenarios that did not occur. They will use words like “if we had done X” or “so-and-so should not have enabled feature Y.” The problem is that you cannot explain how something happened by talking about things that didn’t. The interesting question here is, “What did happen?” Who was involved? What decisions did they have to make? How did they make those decisions? Were they trying something that had worked for them before?
You can learn about more common traps in incident investigation from Johan Bergström’s short video, Three Analytical Traps in Accident Investigation.
Two more important pitfalls that we see in software organisations are:
- Heavy over-indexing on repairs (in an attempt to prevent “recurrence”)
- Use of "shallow metrics" like MTTR (hoping to drive incident count or length towards zero)
Both of these practices create perverse incentives. A heavy emphasis on repairs causes investigators to skip or shorten in-depth analysis of an incident and jump instead to conclusions which are easily repairable. It also creates an environment in which it is easy to overlook the systemic impact of locally-optimised repair work, which could create much more serious problems. Meanwhile, shallow metrics ignore the complexity of real incident response and create pressures for teams to mis- or under-report incidents, which in turn leads to a larger gap between how management perceives operational work and the reality of how it is done.
Avoiding traps when investigating incidents is hard work. Your guiding principle should be to use the approaches in this article to deeply understand what happened during an incident before you attempt to solve any of the problems uncovered. Things that look like problems at first might turn out to be doorways to much more interesting discussions, just like the “human error” label.
Language Matters
One of the most important lessons we’ve learned, both from the research literature and from practical experience, is that the language we use when trying to understand incidents matters.
Our words matter. Our words have consequences. Our words help conjure up worlds for other people—people with legal battles to win, people with prosecutorial ambitions to satisfy, people with insurance payouts to reap and people with design liability to deny... These are worlds where our words attain representational powers that go way beyond the innocuous operationalism we might have intended for them. These are worlds in which real people, professional practitioners, are put in harm’s way by what we come up with. We cannot just walk away from that.
Sidney Dekker
The Danger of Losing Situation Awareness
If you are trying to discover what happened during an incident, then you must ensure that the language you’re using supports that goal. It is very easy to use words that don’t. Two examples serve to illustrate this.
When we use language that wraps up complexity in a neat parcel like “human error,” or we make counterfactual assertions (“system X should be able to detect this scenario,”) we give participants in our investigation the opportunity to agree with something that might be true given what we know in hindsight, but which does not help us understand the behaviour of the people or systems during the incident. Everyone in the room can nod and acknowledge that the human did indeed make a mistake, or that system “X” really should have detected the issue. Have you understood anything new about how your system really works? Unlikely.
Secondly, when we ignore power structures and the social dynamics of the organizations we work in, we risk learning less. Asking “why” questions can put people on the defensive, which might make them less likely to speak frankly about their own experience. This is especially important when the person being asked is relatively less powerful in the organisation. “Why did you deploy that version of the code?” can be seen as accusatory. If the person being asked is already worried about how their actions will be judged, it can close down the conversation. “During this incident you deployed version XYZ. Tell us about the process by which you made that decision” is one way to phrase the inquiry that is less likely to be seen as an accusation, and more likely to expose interesting information.
This is a journey. Learning more about how to ask better questions after an incident is a never-ending exercise. There’s certainly not a template or some “right” set of questions for every organization and context. Another difficulty you may encounter is the strong organisational tendency to push towards solutions before the problems are fully understood. In the next section we look at one way to confront this challenge.
Investigation is Not about Remediation
The processes we develop to understand what happened during an incident are different from those we use when planning how to improve the system for “next time.” We create problems for ourselves when we allow these activities to blur into one another. Keep a clear line by dividing your activity after incident into a couple of steps:
- Independent investigation
- Group investigation
- Action/repair/remediation planning
Independent Investigation
As soon as possible, find 20-30 minutes with individual incident participants, and talk to them to learn how the incident unfolded from their perspective. The goal of the interviews is to understand what they experienced, and of course, help them avoid blaming others or themselves.
Have participants tell the story from their point of view, without presenting them with any replays or reminders that supposedly “refresh their memory” but would actually distort it. Tell the story back to them as an investigator. This is to check whether you understand the story as the participants understood it. If you had not done so already, identify (together with participants) the critical junctures in a sequence of events. Progressively probe and rebuild how the world looked to people on the inside of the situation at each juncture.
Chapter 3 of The Field Guide to Understanding 'Human Error' (3)
Group Investigation
Most often, this takes the form of a facilitated incident review, attended by the participants in the incident. The facilitator should not themselves have been a participant in the incident, unless this is unavoidable. It can be effective to structure the review of the incident around the timeline of events during the incident, which can be usefully prepared with what investigators learned during the independent investigation phase.
The facilitator’s role is to help people understand what others understood at various points during the incident by drawing the groups’ attention to these details, a recommendation that comes from John Allspaw and Dr. Richard Cook of Adaptive Capacity Labs. A skilled facilitator works to create the space for questions to come up and be discussed.
Fully standing in the space of neutrality says to a group “I’m your ‘sherpa.’ I’ll guide you up this mountain, there is a specific process we will follow. And in the end, you’ll get to do the work.
Team Catapult
Agile Team Facilitation: Maintaining Neutrality
Action/Repair/Remediation Planning
The business pressure to identify repairs from an incident is intense. The irony is that identification of repairs is often prioritised over in-depth investigation, even though the intangible learning from such an investigation usually has greater business value than the rapidly-forgotten lists of repairs which many will recognise as the main output of their post-incident process.
If collective identification of repairs is part of your post-incident process, we recommend that you do not undertake this task during the group investigation/facilitated incident review.
Instead, hold a dedicated “repairs planning meeting” a day or two later. This will help you protect the learning goals of the facilitated incident review, and will also offer you some soak time which can help participants to identify more useful or practical repairs.
The soak time period recognizes that the process of learning can be complicated and messy. It is important to stop the group work and give the individual time to think about everything said and identified. It is in these personal review times, where we think about what we have done and said, that we start to better understand what and how the event happened.
Todd Conklin, PhD
Pre-Accident Investigations (4)
You’ll Need a Few New Skills
What we’re proposing may sound like a lot of work, much of which is unfamiliar to many of us in software. The skills mentioned in this article (facilitation, investigation, etc.) are not ones that software engineers are taught in university courses, and aren’t often considered in hiring decisions. The good news is that entire fields of study exist (Cognitive Systems Engineering, Safety Science, Resilience Engineering) in which researchers have been writing about these challenges for a long time. As the systems we work on get more complex and more critical, we all have the responsibility to expand our domain of responsibility and consideration to include the humans and the organizational systems that are part of our work. The tips in this article are a great place to start.
We believe you can’t do this work well if you view your colleagues as broken or defective. Cormac Russell describes different ways of working with those around you in Four Modes for Change. If you want to change anything in your organization, his guidance suggests that you need to “start with what’s strong, not what’s wrong,” and aim to work as an “alongsider” with the teams doing the work. If you work with the team, you may have a chance of succeeding. If you attempt to make changes to or for them, Russell suggests that your changes will likely not last. Alex Elman’s recent article, Shifting Modes: Creating a Program to Support Sustained Resilience, discusses in detail the ways of working that have been effective for him in creating a learning environment around incidents.
Some of our suggestions in this article may be more palatable than others. Our experience, however, is that there is no shortcut to doing this work well. You must, in the words of the Zen proverb, chop wood, carry water.
About the Authors
Jessica DeVita is a senior applied resilience engineer at Netflix.
Nick Stenning is a site reliability engineering manager at Microsoft.
References
- Fitts, P. M., and R. E. Jones. 1947. “Analysis of Factors Contributing to 460 ‘Pilot-Error’ Experiences in Operating Aircraft Controls.” Air Materiel Command, Engineering Division Memorandum Report, TSEAA-694-12.
- Cook, Richard I., and Beth Adele Long. 2021. “Building and Revising Adaptive Capacity Sharing for Technical Incident Response: A Case of Resilience Engineering.” Applied Ergonomics 90 (January): 103240. (Published online 2020).
- Dekker, Sidney. 2014. The field guide to understanding 'human error'. Farnham: Ashgate Publishing Ltd. .
- Conklin, Todd. 2012. Pre-Accident Investigations: An Introduction to Organizational Safety. Farnham: Ashgate Publishing Ltd.