At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation (while being unable to justify doing it specifically) is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.
Allspaw, co-founder at Adaptive Capacity Labs and ex-CTO at Etsy, established the need to study and learn from the field of resilience engineering by asking the audience two questions. First, he asked if anyone had ever responded to a production incident where they had experienced a profound sense of confusion around the signals and signs being seen, which ultimately led to a complete loss of a plausible explanation. Second, he asked that when responding to an incident, had anyone attempted some action to mitigate the situation which led to the feeling of anxiety that there is a non-zero chance they were going to make the situation worse. Many hands were raised in response to the first question, and the room filled with confirmatory nervous laughter in response to the second.
Resilience engineering is a field of study that emerged from cognitive system engineering in the early 2000s, largely in response to NASA events in 1999 and 2000. Resilience engineering is also a community, which is largely made up of practitioners and researchers from other practically-focused fields of study, such as human factors and ergonomics, cognitive systems engineering, safety science, operations research, control systems and several more disciplines. There have been seven Resilience Engineering Association Symposia organised during this time, the proceedings of which are available online.
Allspaw argued that very little theory within this domain is presented that doesn’t emerge from studies of real work; resilience engineering works within high-stake domains such as aviation, construction, surgery, military agencies, law enforcements, etc., and mostly recently, within software engineering. He shared a list of characters involved within the discipline of resilience engineering, and also name-checked several software engineers that have recently begun working within (and formally studying) this field, such as Nora Jones, Casey Rosenthal, Jessica DeVita, J.Paul Reed and himself.
Next, he presented a model that can be used as a lens to look at how we typically interact with systems. In the following diagram the “line of representation” separates the system artifacts, shown below the line, with the system cognition, shown above. At the bottom right of the diagram we can see the code, the database, and the users of the product and services. In the bottom left the tools that we use to make the services work are shown, and this is where engineers spend a lot of their time. However, if we solely focus here we may miss where the real work happens -- above the line.
Moving above the line of representation is where we find all the people that interact with, or operate, the system. Each person, or group, has a mental model of the system, but this differs depending on the associated responsibilities and focus of the people. Interestingly, nobody has the same representation of the system. People perform their cognitive work above the line, using the models they have constructed, and they interact with the system below the line via a series of representations -- not the “things” themselves. The only information we have comes from these representations, and all interactions use these representations. In some sense, the things below the line don’t technically exist, i.e., we never “see” the code execute, we never “see” the system actually work, and we never actually touch the system.
You manipulate a world you cannot see via these series of representations, and this is why you need to build these mental models [...]
It’s not the world below the line that’s doing it; it’s your conceptual ability to understand the things that have happened in the past, and the things you are doing now... why you are doing those things, and what matters at the time, and why what matters, matters.
Allspaw argued that resilience engineering and related disciplines have adopted the idea that we are actually “working” above the line. The cognitive activities, both individually and collectively, are what make a system and related organisations work. Although the disciplines mentioned above that have embraced resilience engineering, such as aviation and law enforcement, have been using this lens for some time, this is a new concept for software engineering. Much can be learnt by applying resilience engineering to software engineering and delivery.
When we look closely at incidents within software delivery we find that many people “show up” in order to attempt to resolve the situation. Each person brings their own ideas, knowledge and views, which are based on factors such as their tenure, domain expertise, and past experience with details. This provides multiple perspectives on what “it” is that is happening, what can and cannot be done to “stem the bleeding”, who has the authority to take certain actions, and what should not be tried in order to mitigate or repair the system. There are often multiple threads of activity -- e.g. problem detection, hypothesis generation, coordination etc. -- which are sometimes conducted sequentially and sometimes concurrently. With the benefit of hindsight it is possible to identify some of these activities as productive and some as unproductive.
Incidents are messy. Unfortunately they don’t follow the “step 1, step 2…” that you saw at the conference or read in the book.
Allspaw argued that it is beneficial to study incidents in order to understand cognitive models and decision making processes, as dealing with this type of situation typically has time pressure and high consequences. This is not “debugging” or “troubleshooting”, as these terms aren’t sufficient to describe the required approach to resolve these issues. The more official term would be “anomaly response in dynamic fault management”.
In defining what “resilience” means, Allspaw shared several things “that are not what resilience is”: preventative design, fault-tolerance, redundancy, chaos engineering (although there is complementary fit in relation to the need to run experiments in order to understand how the system reacts to failure), something about software or hardware, or a property that a system has.
Resilience is aimed at setting and keeping conditions such that unforeseen, unanticipated, unexpected and fundamentally surprising situations can be handled.
In more detail, resilience is: proactive activities aimed at “preparing to be unprepared, without an ability to justify it economically”; sustaining the potential for future adaptive action when conditions change; and something that a system does, not what it has. Another way of thinking of resilience is “sustained adaptive capacity”, or as Richard Cook defined, “being poised to adapt”.
Finding sources of resilience means finding and understanding cognitive work. Allspaw argued that “all incidents can be worse”, and resilience engineering asks “what are the things (people, maneuvers, knowledge etc.) that went into preventing it from being worse?”. Quite often when incidents are reviewed the “Safety-I” perspective is to “find and fill all of the gaps”, whereas the “Safety-II” perspective asks “what are the things that prevent these incidents from happening normally?”. The answer to this question can be labelled as “adaptive capacity”.
Allspaw recommended searching for adaptive capacity by finding incidents that have a high degree of surprise, and whose consequences were not severe. We can then look closely at the details about what went into making it not nearly as bad as it could have been (and we should also protect and acknowledge explicitly the sources that we find). Looking at incidents with severe consequences is not recommended, as scrutiny from stakeholders with face-saving agendas tend to block deep inquiry. With “medium-severe” incidents, the cost of getting details and descriptions of people’s perspective is low relative to the potential gain.
Contextual sources that can provide information about decision making, processes, and potential for adaptive capacity include the following:
- Esoteric knowledge and expertise in the organization
- Flexible and dynamic staffing for novel situations
- Authority that is expected to migrate across roles
- A “constant sense of unease” that drives explorations of “normal” work
- The capture and dissemination of near-misses
In summary, resilience is something a system does, not what a system has. Creating and sustaining adaptive capacity in an organisation, while being unable to justify doing it specifically, is resilient action. How people -- “the flexible elements of the system” -- cope with surprise is the path to finding sources of resilience. Quoting Cooks, Allspaw concluded the presentation by saying the “resilience is the story of the outage that didn’t happen.”
The slides for Allspaw’s talk “Amplifying Sources of Resilience: What Research Says” [PDF] can be found on the QCon London website, and the video of the presentation will be made available on InfoQ over the coming weeks.