Key Takeaways
- Incidents are highly variable events; a company’s incident response practices need to be clearly defined but also adaptable to fit their demands
- Engaging roles from across the organization during an incident increases the adaptability and efficiency of the response by ensuring the multiple stakeholder needs and capabilities are represented
- Groups need shared language and tools to enable strong interdisciplinary communication
- Building practices that provide resources and support to front line responders sustains their ability to be effective during an incident and for future incidents
- Incidents are inevitable; dedicated humans (incident command or otherwise) with expertise or focus on response are vital to continually improving and adapting incident response processes
To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).
While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.
The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.
Mike Tyson once said, "Everyone has a plan until they get punched in the mouth."
Getting punched in the face is about as surprising an event as one can have. Metaphorically speaking, as incident commanders in a modern IT company, we get punched in the face all the time. It’s part of the job description. And it is a painful truth that planning gets you in the ring and to the final bell, but avoiding a pummeling in between those points is all about how you bob and weave - adapting to the surprises thrown your way.
Incidents are fast paced and have many elements that are unknowable beforehand. Every incident is different. We can’t plan for every aspect of them. Even if it’s an incident that has happened before, there are different people and parts of the system involved with different pressures and priorities at play. Creating a process that gives responders enough structure to be successful, but is not unwieldy or overly dictatorial - thereby limiting the ability of responders to adapt in real time - is a challenging edge to walk. We’re not offering a programmatic approach to building an adaptive incident response structure but we will outline some of our philosophies and core practices that have resulted in successful responses as well as provide examples from the times we’ve been "punched in the face" by an incident.
Previous articles in this series have outlined adaptive capacity as a source of resilience. This article zeros in on the sources that comprise most of your company’s adaptive resources: your frontline responders. In this article, we draw on our experiences as incident commanders with Twilio to share our reflections on what it means to cultivate resilient people.
Incident Command: What Would You Say You Do Here?
Common notions of an incident commander (IC) can have some very sharp edged and specific connotations: authority, delegator, decider. At Twilio, while we hold the title of "incident commander," we approach the role very differently. Our IC’s are experts in our incident management process and tools, but our response team members are the experts in the parts of their respective products or systems. Our shared goal is to efficiently facilitate all aspects of the incident for an optimal customer experience. We have found we function best as a response team by approaching incidents with a servant leadership mindset and providing coaching to responders, not commanding them into action. Because of our different but related roles, how we exercise authority, delegation of tasks and decision making is relative to the nature of the problems faced. Others can, and will, have better reasoning, insights and, at times, greater authority than ICs in particular domains. Part of our role as leaders is to know when and how to adjust as the incident unfolds. A similar approach is taken in High Reliability Organizations (HRO’s) that cultivates and engages expertise from across the organization, then defers to the most appropriate expertise during problem solving (Weick & Sutcliffe, 2007).
So how do you know in ambiguous incidents what the most appropriate expertise is for the problem? Or, in companies of tens or hundreds of thousands, who holds that expertise? In Twilio’s model of incident response, the answer to both of these lies in how we design for coordinative efforts. Incident commanders are incident experts, but not experts in the subject of the incident at hand. Therefore, they require a diverse set of skills and capabilities to rapidly form a collaborative team to help make sense of what is happening and how to repair it. That is where incident owners come in. This role knows the area of impact and understands who and what we’ll need to be able to resolve it. However, they cannot help direct incident response operations while also trying to fix the problem. The incident owner isn’t implementing, they’re delegating. These two roles work collaboratively.
In our world, the incident owner role is filled by an engineering manager, product owner, tech lead or principal engineer. This is a person with knowledge, experience, access and a strong sense of the network of available frontline responders. In other words, part of their skills is knowing who to recruit and how to get them involved. The frontline responders being paged into the incident are the people who dig to find what’s gumming up the gears: typically called site reliability engineers (SRE), Devs, or Ops. Our teams of frontline responders are not just SREs. We also include customer-facing representatives providing customer examples and writing the status page, as well as InfoSec/Security, whose job is to keep us compliant and safe. This diversity of roles and perspectives means we get a more complete view of both the problems faced (current and emerging) and the resources available to address them. It always requires a lot of finesse to direct the incident response smoothly with an ad hoc group who may have not worked together before when they were in high pressure, time constrained events.
So, if we don’t "command" our incidents, then why don’t we call ourselves facilitators, coordinators or incident coaches? It is a useful question to ask and there’s a somewhat surprising answer!
As mentioned, we’ve worked hard to establish a culture of supporting our responders so we don’t need the title to impart authority downward. However, it can work to impart authority upward. The title of incident commander is an ambiguous one in most organizations - is a "commander" more or less senior than a director? A manager? A VP? While there is ambiguity, it’s clear there is authority inherent in the title, which aids a timely response in a crisis situation by immediately providing access and opening up dialogue with outside stakeholders that otherwise may need to be run up and down chains of command. Commander as a title is more of a shield to deflect and coordinate other, more senior stakeholders during an incident. In other words, when leading the response, you are coaching incident responders and commanding engineering directors or executives. So we refer to ourselves as incident commanders, but make it clear to those on the frontline, responding to an incident, that we’re there to help get them through this incident.
Our incident command may not be a traditional business hierarchy but it represents the viewpoints of different aspects of the organization. If frontline responders are at ground level on the front lines, incident owners have a 1000 foot view, and incident commanders are hovering 10,000 feet. Engaging and coordinating a variety of roles across levels of the organization and deferring to their level of expertise to meet problem demands is a form of dynamic reconfiguration found in the Theory of graceful extensibility (Woods, 2018) and helps us drive successful incident resolution.
Navigating Incidents
Some say the first step to solving a problem is admitting you have one. Once you acknowledge the problem, the next step is to find the correct people to help solve it. Sound’s simple, so why is it so hard?
What makes it an incident
Many companies struggle with defining an incident. To us, an incident is when a service or feature functionality is degraded. But defining "degraded" contains a multitude of possibilities. One could say "degraded" is when something isn’t working as expected. But what if it’s better than expected? What’s the expected behavior? Do you define it based on customer impact? Do you wait until there’s customer impact to declare an issue an incident?
This is where having a common and shared understanding of the normal operating behavior of the system and formalizing these in feature/service level objectives and indicators are key. We have to know what we expect, to know when a degradation becomes an incident. But, defining service level objectives for legacy services already in operation takes a significant investment of time and energy that might not be available right now. That’s the reality in which we frequently operate, trading off efficiency with thoroughness, as Hollnagel (2009) points out. We handle this tradeoff with a governing set of generic thresholds to fill in for services without clear indicators.
At Twilio we have a lot of products, running the gamut from voice calls, video conferencing, and text messages, to email and two factor authentication. If you think these different services can share generic incident thresholds, please pause here and imagine sad laughter followed by a tired sigh. The two of us spent a considerable amount of time banging our heads against an attempted unified severity matrix for all the different products above. Why couldn’t we figure this out? We did a lot of research and asked for advice from others in the field. But we struggled to find anyone else’s homework to copy, until we came to the painful realization that maybe these things don’t fit together because they don’t fit together.
Instead, we shifted gears to try and define what we call "sane defaults," or things that feel reasonable to declare incidents: (internal or external) customer driven impact, functional state determinations, and business level non negotiables (like public Service Level Agreements). These work as a baseline for teams without established SLIs to point towards an incident. But the truth is, the sane default we try to impress upon our responders is more simple: err on the side of caution. Just declare an incident. Worst case scenario, we realize nothing is broken and learn a little bit more about our systems. Best case scenario, we find something is broken and have the right people ready to solve it. We’re all familiar with the phrase "Innocent until proven guilty"; this is its weird cousin, "Incident until proven incidental."
In optometry, there’s a peripheral visual field test where you stare at a center point on a screen and press a button when you see moving blurry lines. The blurry lines in your periphery aren’t what you’re focused on, but you need to press the button when you see them. The same can be said about how the line between bugs and incidents will be faint and blurry until some degree of service level objectives and indicators are defined. If you’re unsure when you see it, raise the alarm. Our systems are always changing so these thresholds will also likely shift or we will need new ones or retire or recalibrate existing ones. Given this is a source of uncertainty, our approach is to work to define some structure to prescribe action through the SLAs while also accounting for the uncertainty. We do this by supporting our frontline responders to escalate an incident without fear of doing so.
Figuring out who to page
Once an incident has been declared, the challenges of navigating the incident remain. There are multiple schools of thought around who to page when an incident is declared. You could alert everyone on-call, raise the alarm and only those it pertains to respond, while those outside the area of impact keep an eye on it, just in case. Or you could only page one team, or a group of teams; maybe they’re the teams whose alerts triggered the issue, perhaps they own the service most likely impacted. The former could easily contribute to alert fatigue. If you’re paged for every incident, the pages start to feel less urgent. It’s easier for simpler organizations, because the fewer the teams and less complex the systems, the more likely it is that everyone will need to be involved to some degree. The latter works well for large organizations where paging every on-call for every incident is untenable. But the more complex the systems, the higher the likelihood that it’s also complex to identify which team should be paged.
There isn’t a right or wrong way for how to page teams for incidents. Paging everyone is noisy, but it ensures visibility and engagement from a variety of teams who may be required to assist in understanding the problem. Paging one team keeps incident pages urgent; it focuses the response to those with specific subject matter expertise on a given issue and disrupts work less overall. Is it harder to disengage a swarm of engineers when they’re not needed? Or harder to pull in additional on-call responders who can help identify where the problem could be originating? Weighing the pros and cons of who to page is personal to every organization and should be informed by several points of data: how frequently incidents are occurring, how current alerts breakdown per team, maturity of the organization, the variety of products, and their level of interdependence. Experiment with rotations, grouping services, establishing complex architecture to service-to-team maps; whatever you choose, continue to inspect whether the avenue you’ve chosen is successfully assembling the team you need while preserving your team’s quality of life.
Finding an owner (Incident Hot Potato)
Companies who have experienced incredible growth have many products, systems and teams, and in turn, the potential for complicated dependencies. A side effect of this very positive growth problem is that the complexity of the domain, especially during failure cases, can make it hard to find the owning team who can best help drive the incident. As we described earlier, responders in the incident owner role, who are often more senior in the organization, are well positioned to help make these designations.
To minimize what we like to call "Incident Hot Potato," we ask our first-paged responding team to stay engaged with the incident. The incident owner and paged team starts with the basics, like working through published runbooks, checking logs of obviously broken things, and helping formulate customer communications. Meanwhile, the incident commander drives the effort to get the correct team(s) engaged. As a side note, we leverage checklists, which outline the expectations for each role during and after incidents. Having a common understanding of what steps are required to be successful in the role is helpful for responders and has a knock on effect of allowing the handoff to focus on the signs, symptoms and any technical triage that has been undertaken. The outcome is a better understood, but not yet fully solved, incident that can be more easily handed off to the appropriate team.
Get together on video
We know there are engineers who will vehemently disagree with us on this one but we’re saying it anyway - get on a video call. It’s surprising how often this is not being done when an incident is created. Talking is faster than communicating via chat; speaking is the synchronous communication. Responders can catch each other up, identify routes to investigate and report on dashboards much more quickly. There also is the handy side effect of establishing an inherent trust between those working on an incident together. Studies show that humans connect with each other better when we can see and hear each other (Schroeder, Kardas & Epley, 2017). We can recognize more nonverbal cues of intent, tone and engagement with ease over video.
Before quarantine distributed most of our workforce into bedrooms and home offices across the world, we would pile into one conference room to work on an incident together. One of our offices even has a standing "war room" that could be taken over at any point to host incident responders. A war room doesn’t need to be special but a conference room with TVs to screen share, whiteboards for brainstorming, and enough space for incident responders to pile in; side-by-side works best. We have pictures from past company events where incidents took place: our war room has been a picnic table, on the deck of an aircraft carrier, and under a crystal chandelier in a hotel ballroom! It’s not the room, but the act of gathering together that takes incidents from feeling like a problem that you are trying to solve, to a troubleshooting effort you are a part of. Video calls are our best approximation of that experience. Incidents are stressful environments and being able to talk things out with others working on the same problem eases some of that isolation and anxiety.
Even low severity incidents benefit from getting on a video call. In a recent incident there was only a very small group of responders, but the chat communication was stagnant and the customer facing representative wasn’t getting any responses. An incident commander was pulled in and had them all hop onto a video call. It turns out the responders got stuck in a rabbit hole while trying to find the answer to the customer rep’s question and started debugging. While this was certainly helping to move the incident forward, the lack of visibility meant the customer rep was left wondering if anyone was actually working on the incident. In a handful of minutes, everyone was caught up, options were discussed, expectations were set, and the issue itself was mitigated shortly after.
While there are many upsides to hopping onto a video call for incident coordination, there is a glaring downside: the need to actually expend effort to document a timeline of events. Chat transcripts are the easiest way to build a timeline after an incident. When an incident’s troubleshooting and coordination is happening over a call, you will need to document the timeline in real time. Luckily, this downside has its own upside. Documenting a timeline in real time means that you don’t have to combat hindsight bias while doing it. It’s so easy to build a timeline of an incident after the fact, omitting the side quests we took that didn’t pay off. But discussing those options, how they didn’t pan out or why they were too tricky to continue to pursue is valuable knowledge. This gives us even more interesting information to review after an incident.
Use your words
Whether you’re talking or typing, explaining things to someone else helps you understand them better. It also allows other people to ask questions, challenge assumptions and keeps the ball rolling. The last article in this series hit on the importance of stating your assumptions. We wholeheartedly agree that comparing and confirming assumptions is vital during incident troubleshooting. We’d add that it’s important to do so in the simplest language possible. Nobody wants to be the person who doesn’t know something or to be the person who is wrong. But there is no time or space for that during an active incident. You never know which question could be the spark that leads to a solution. There is no such thing as a stupid question, especially when we’re trying to solve a problem where we don’t have all the details.
At Twilio we are a single company, but our ecosystem consists of a large array of different products; we can’t know all of the bespoke acronyms, service names, dashboard titles etc. We are large groups of small teams who talk about the same things in different ways. Talking to each other in an incident as though others might not already know what exactly you’re talking about is the best way to find common ground. It verifies your base understanding, while exposing assumptions you may not know you have. We see the value in this regularly in issues with downstream dependencies or vendor incidents where multiple services are integrated with a vendor product differently.
In incidents with a diverse stakeholder group, simplifying language and cutting through internal jargon is beneficial in two ways: it makes it easier to compare assumptions and addresses the very real challenge of customer communications. "Internal" technical and "customer-facing" technical are two different languages. Acronyms, service names, and secret sauce recipes all have to be filtered out. That requires a working knowledge of both internal and customer-facing technical languages. Working with your customer-facing folks to teach them your internal jargon and listen to what they’re hearing from customers helps establish better common ground (Klein, Feltovich, Bradshaw & Woods, 2005; Maguire & Jones, 2020). It needs to be second nature to explain what’s happening technically, and what that means for customer impact. Customer facing representatives are then better equipped to facilitate the translation to customer facing terminology. Lastly, the incident may very well become public depending on its severity, length, impact on customers, and disruption to business operations. As such, other business units might be on the call -- such as public relations, government relations, industry analyst relations, and even internal communications for the broader employee base -- whose audiences aren’t technical in nature. Thus, speaking or at least interpreting events and developments in generic, easy-to-understand terms can be important.
Ideally, you are declaring incidents based on SLO’s and SLI’s and solving the problem before the impact hits our customers. But we’re also here to talk about the reality of incident response. And that reality includes incidents where our customer-facing teams are inundated with chats, tickets and calls. We need to tell people what’s going on in an incident. We need our customers to know we know something is wrong and we’re working to fix it. We need our internal teams to know, in case they have dependencies or valuable insight that could help mitigate the impact. We need our executive team to have relevant information because they’re likely fielding questions from high profile customers. Refining and simplifying what we understand about the incident is key to communicating and solving the problem.
A human touch in incidents can take a mess of failing software, kilobytes of log error messages, pages from services and a string of support tickets and instead create a cohesive experience for our customers. In the next section we cover techniques for ensuring your humans are well resourced.
Managing Humans in Crisis
Image the scene: it is hour 20 of the incident. The system has not been behaving in a predictable fashion for 18 of those hours. Our team of eight engineers still can not definitively say if large swaths of our customers data will be recoverable. The "progress" we have made in triaging this issue has put the smoking gun squarely in the court of a questionably supported open source adaptor library written in a language no one on the team is familiar with. Our most talented engineer has taken himself through a crash course in Erlang and is trying to debug code he has never seen before in a language he doesn’t know.
While it is a little short of being the plot of the next Avengers movie, it is an example of some pretty serious incident response heroics.
Ambiguous or uncertain events often require flexibly applying knowledge in novel ways (Klein, 2011), and incident response can benefit from and inspire heroics - and these efforts sometimes pay off! Our coworker digging deep and learning passable Erlang on the fly eventually saved our bacon. One of the reasons he had the energy to show up like a superhero and expend considerable cognitive resources is that once we realized this was "a big one" we shifted our strategy for managing the response. Severe incidents almost always start as sprints but once we realize we are actually running a marathon, it’s critical to adapt your tactics. At this inflection point, it’s necessary to broaden your focus beyond the immediate technical problem and start thinking about sustaining the well-being of your responders.
Successfully working an incident requires creativity, adaptability and endurance. People are mostly able to meet these demands when they are well-resourced. As incident commanders, we are always keeping an eye out for the needs of our responders, especially during marathon events, making sure the team eats regularly, drinks water, and is able to call for additional help so they can take a minute to breathe, even during high tempo events. During the epic event mentioned above, we set up a system to manage the incident across this longer timeframe. We started tracking who was engaged, when they had last had a break, and what the succession would look like when it was time for them to take a break (or you know, sleep).
Pre-covid, when there was a lull in activity, we would run to the office kitchen and grab snacks, or order a pizza delivery if incidents stretched into the evening. There are additional challenges with a distributed team as it’s not as visible when someone needs to be encouraged to practice good self care and to notice when folks not used to working in high stress environments are hitting an emotional limit. This means a part of the IC role is to pay keen attention to responders' demeanours and how they are showing up. Humans who are constantly stressed and under-resourced are well set up to burn out, not just from the incident, but also from the company. What constitutes resilience is not just in the moment of the incident, but over a longer timeframe. Taking time to recover after incidents and acknowledging the mental and emotional impact that (even well-managed) heroics can invoke, can actually sustain overall resilience (VA healthcare, 2013). If your incidents are chronically demanding, you can exhaust your team’s adaptive capacity. Constant turnover in your incident response team means losing valuable organizational knowledge which lowers resilience. Creating more slack in your on-call schedule to allow for more rest, and managing workload to allow for adequate rest, is sustaining your team’s ability to cope with future incidents.
Conclusion
We will hazard a guess that nearly everyone reading this article has been in a postmortem where a more senior, potentially technical (but not always) leader has said something like, "LeTs eNsUrE tHiS nEvEr HaPpEnS aGAin." The response to which there is a laundry list of corrective actions produced.
Incidents can be embarrassing, stressful and costly. And in the aftermath there is a strong urge to make promises to assure customers that, now we know more than we did before the incident transpired, and we can successfully "get ahead of it" next time. Systems should be continually improved by fixing simple errors but these practices will only help avoid headslappingly obvious incidents but not *all* incidents. We exist in a dynamic environment. We can modify our systems to avoid exact duplicates of failures but we will never "prevent" future incidents that may occur because our context is ever-changing.
"Aspirationally, we do want to avoid catastrophic STKY/STBY events (Stuff That Kills/Bankrupts You), but the sinister thing at play here is the uncertainty that arises with complexity. Making the goal to learn as much as possible about incidents in order to generate insight is what allows us to outmaneuver the complexity that’s coming at us as we continue to be successful, all while minimizing the amount of operational human misery needed to support the system."
A better question is, "During our next, inevitable incident, how can we respond better?" A team’s response is really the only true control one can have in the face of future failures. Therefore, our perspective is that it is not useful to say we need to "stop failing." Instead, as incident commanders and managers, we should be saying we need to "respond to failing better/faster/more efficiently." From our experiences at Twilio, the philosophies and practices outlined above help support front-line practitioners adapt to the surprises that challenge our complex, dynamic systems. These philosophies and practices include: proactively identifying the roles and responsibilities for responders, regularly inspecting who is paged and how incidents are defined, getting everyone together and talking, and remembering you’re only human. To do so is to make an investment in the ongoing resilience of your organization and help your responders to be better able to make it to the final bell without getting punched in the face.
References
- Hollnagel, E. (2009). The ETTO principle: efficiency-thoroughness trade-off: why things that go right sometimes go wrong. Ashgate Publishing, Ltd.
- Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational simulation, 53, 139-184.
- Klein, G. A. (2011). Streetlights and shadows: Searching for the keys to adaptive decision making. MIT Press.
- Kitchens, R., (2019). Characteristics of Next-Level Incident Reports in Software. Learning from Incidents blog. Retrieved Jan 12, 2021.
- Maguire, L., & Jones, N., (2020). Learning from adaptations to coronavirus. Learning from Incidents blog. Retrieved Dec 30, 2020
- Schroeder, J., Kardas, M., & Epley, N. (2017). The Humanizing Voice: Speech Reveals, and Text Conceals, a More Thoughtful Mind in the Midst of Disagreement. Psychological Science, 28(12), 1745–1762.
- Weick, K. E., & Sutcliffe, K. M. (2001). Managing the unexpected (Vol. 9). San Francisco: Jossey-Bass.
- Woods, D. D. (2018). The theory of graceful extensibility: basic rules that govern adaptive systems. Environment Systems and Decisions, 38(4), 433-457.
- VA Healthcare. (2013). The Stress Response and How it Can Affect You. Retrieved Jan 25, 2021.
About the Authors
Emily Ruppe has been emphasizing customer focus and human factors in incident response at SendGrid and Twilio over the last five years. Starting in technical support over a decade ago, she developed a reputation as a harbinger of customer found outages and defects, to the point that it was assumed her presence meant something was broken. Ruppe has written hundreds of status posts, incident timelines and analyses. She is a founding member of the Incident Command team at Twilio.
Ryan McDonald has been an advocate for better software practices as an engineer, program manager and now incident commander. After being trained and participating in years’ worth of backcountry search and rescue responses during his time at Outward Bound, software incidents were a welcome (and lower risk) venue to scratch the "responder" itch. McDonald is also a founding member of Incident Command at Twilio.
To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).
While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.
The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.