InfoQ Homepage Articles Exploring Costs of Coordination During Outages - QCon London Q&A

Exploring Costs of Coordination During Outages - QCon London Q&A

May 22, 2020 10 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Coordination is unavoidable in managing complex, distributed systems failure as multiple, diverse perspectives are needed to bring different skills, knowledge and experience together to help resolve challenging outages.
These perspectives represent both direct responders and indirectly impacted parties such as users and other stakeholders. All must be managed jointly to address both acute issues and chronic goals (such as customer satisfaction).
Distributed incident response incurs costs of coordination. Controlling the costs of coordination is both challenging and crucial.
Diverse, non-collocated parties need to be arranged in the right collaborative interplay to provide the benefit of diversity with low costs.
Costs of coordination are found in both human-human teams and in human-computer interactions. Designing for coordination can limit costs of both.

Coordinating different skills, knowledge and experience is necessary for coping with complex, time-pressured events, but it incurs costs. Well-designed coordination is smooth and can be trained for. Learning how to take initiative, being observable to your counterparts and engaging in reciprocity are examples of strategies engineers can use to lower costs of coordination during outages.

Laura Maguire, cognitive systems engineer & researcher, spoke at QCon London 2020 about her research on the costs of coordination during outages.

InfoQ interviewed Laura Maguire about the challenges of coordination during outages and the costs that come with them, helpful and harmful patterns of coordination, the skill set needed for the coordination and how to develop those skills, and how efforts to control the cognitive costs of coordination are related to resilient performance.

InfoQ: What are you investigating in your doctoral research?

Laura Maguire: This work is based on a three-year engagement with the SNAFU Catcher’s Consortium where we worked with a number of organizations that operate critical digital infrastructure to study incident response and the cognitive work of DevOps engineers.

At its core, my research is about how to design better coordination across human and machine teams.

On the human-human teaming front, technological advances have enabled typically co-located activities to be distributed across time and space. But anyone who has ever been on a conference call where the connection is poor and video or audio is lagging can attest to the limitations of these technologies. This kind of friction, along with other ‘losses’ from not being co-located (like being able to see if someone is able to be interrupted or is free to help you) can add cognitive load and coordinative costs. In day-to-day operations this is merely irritating, but in time-pressured, high demand circumstances like dealing with a service outage on a business critical function, this additional load matters greatly.

On the human-machine team side of things, the integration of automated or intelligent ‘co-workers’ has added a layer of complexity to managing large scale systems that run at speed. Automating repetitive functions (particularly ones that need to be done quickly) can lessen toil and take advantage of machine capabilities in very useful ways. However, unless you have explicitly designed for observability and coordination with these tools, they can also be an unintended source of cognitive and coordinative burden. As part of my research, I thought about automation as a co-worker - a member of the team with specific skills and interests but limited ability to communicate about more than its core functions and even less ability to coordinate its interactions with the rest of the team. It had a profound effect on how I think about designing these tools to be a better team player.

InfoQ: What are the challenges that come with coordination during outages?

Maguire: When there is a disruptive event many people are impacted - both end users who rely on the service, and the engineering teams tasked with reliability. As disturbances expand, consequences expand and uncertainty grows, engaging more and more roles, levels, and organizations to get involved which drives up the tempo of the incident response.

More roles provide more resources, but they come at the price of additional demands to coordinate activities, share information across perspectives, and integrate diverse models of how the system functions and malfunctions and the changing dynamics of the event over time. These additional activities represent additional ‘costs’ in the sense of mental effort to responders. Under the typical demands of incident response, these additional costs can have significant impacts.

Thus, understanding how to manage distributed incident response in ways that manage the costs of coordination in this setting is both challenging and important.

InfoQ: What are the hidden costs of this coordination?

Maguire: The costs are additional cognitive effort and activity involved in joint activity (again, with other people or with automated ‘team members’).

For example, research has shown that multiple, diverse perspectives bring different skills, knowledge and experience, that in collaboration, can be crucial to resolving particularly challenging outages. However, in order to have those people be useful, the person looking to recruit them incurs additional costs - in determining what skills are needed, who has them, if they are available, what information will they need to be able to come up to speed and how to get a hold of them- all of these things before they even make the request for help!

Some of these efforts are recognized and we can build cognitive aids, such as organizational charts that include competencies or chat bots that can page people out. However, until now there hasn’t been a full accounting of this kind of effort in incident response - or, to a large extent in other first response teams so these kinds of costs can get ‘hidden’. When the costs become too high we see people adapt to control them. This often manifests as ‘dropping out’ of joint activity like when a group of responders drop a group audio bridge and use a ‘side channel’ with a smaller group to follow up on a promising lead.

InfoQ: What are the helpful and harmful patterns of coordination?

Maguire: Well what is helpful and what is harmful really depends on whose perspective you are taking. It’s important to recognize that these patterns are adaptations to cope with multiple competing demands. During an incident there are always a number of competing demands of a finite amount of attention. People adapt to prioritize the most important efforts. As an example, it’s very common in poorly designed work systems to see support engineers not communicating with their users. Anyone who is on Twitter knows that when you are having an outage and you’re not communicating with your users, you are going to make a LOT of people very unhappy and probably compound the problem, because by not communicating, you are now dealing with a difficult outage AND outraged users who might be publicly bashing your service while simultaneously flaming your boss. The impacts can cascade and now you are dealing with three problems – the outage, the damage to the reputation and the additional pressure of a stressed out manager who keeps interrupting you looking for updates (or worse yet, starts directing the incident response without context).

This means it is always useful to try to understand the pressures and constraints that might be driving a locally productive but ‘globally’ counterproductive action or behaviour. That being said, I tend to see ‘harmful’ as being in the eye of the beholder and I have seen some really excellent strategies from the high performing teams we studied, which I presented in my talk at QCon London on Exploring Costs of Coordination During Outages. At a high level those are:

Investing time and effort to establish and maintain common ground (shared knowledge, beliefs and assumptions of the problem).

Designing a response play based on the choreography or the ‘movements’ of coordination instead of a command and control structure.

Focusing efforts on coordination ‘at the boundaries’ (this means across teams or vendors).

Always be learning. Design for lightweight, real time model updating to occur and make space for proper post-mortems that are accessible to different kinds of roles in your company and are shared broadly.

Continually investing in the people you expect to not only keep the system running but get it back up when it inevitably falls over.

InfoQ: What is the skill set needed for the coordination? How can people develop those skills?

Maguire: I’ll start by coming back to my earlier point which is absolutely fundamental to my research - smooth coordination should first be designed for then trained for. The best responders working with the worst coordination design will always struggle.

I saw this firsthand multiple times when engineers highly skilled in adaptive coordination hit the wall of poor (or no) coordination design. Despite their best efforts, the costs of coordination became too high and the joint activity broke down leading to substantial delay in resolution and anger & distrust amongst responders. So design first is key. The software industry has a massive advantage over other domains that face similar coordination demands because they have the ability to develop the tools that are most critical to aiding coordination and collaboration.

As important as design is, it’s also clear that being a skilled collaborator is fundamental to being a skilled DevOps engineer. It’s implicit in all aspects of modern software engineering, from social coding to continual learning to incident response.

Of course, there’s a suite of tactical techniques for an individual or team to be proficient at coordination, but strategically, things like learning how to take initiative, being observable to your counterparts and engaging in reciprocity have deeply powerful effects. At a strategic level, they recognize that coordination is critical but that it doesn’t come for free. Consequently, they invest in understanding it through research, reflection and continually adapting their practices to improve. This was at the core of the partners who were involved in the second cycle of the Ohio State SNAFU Catcher’s Consortium. That cycle focused on controlling the costs of coordination, and the report (which will be out in early summer 2020) has a lot of helpful guidance in it.

As a starting point for readers wanting to further develop their skills, one key practice I saw in the high performing teams I studied for my research related to learning from their incidents.

At the core, they invested in learning about: their system and the ways it breaks and degrades across their dependencies, the people who might have useful insights to contribute to different kinds of problems, and experimenting continuously.

What I mean by ‘investment’ is they built in the capacity for continual reflection on their practices despite the demands and time constraints inherent in running large scale systems undergoing continuous change.

There were formal ways of doing this, for example, by consistently making time for meaningful blameless post mortems that were open to the whole company to attend, and developing training and mentorship for incident responders. There were also informal methods such as cultivating a culture that encourages reflecting on understanding how difficulties get handled well across your response team.

InfoQ: How are efforts to control the cognitive costs of coordination related to resilient performance?

Maguire: At a high level, we see evidence of engineers controlling for the cost of coordination in several ways - they eliminate coordination (by doing the thing themselves), they push it out in time (deferring responses to a less demanding moment), they degrade their coordination (interacting with less quality) or, somewhat ironically, they recruit additional resources to help manage the additional demands even though making those resources useful required more effort. To an outsider, these ways of coping with high costs of coordination can appear to be anti-patterns.

For example, many incident response protocols require responders to work problems in a shared channel (or on an audio bridge), but if there are a lot of people and a lot of multiple, concurrent efforts underway it can be a source of distraction that is counterproductive to participate in. So, two responders might begin direct messaging each other to try and work a specific problem independent of the larger team. By controlling the cost of participating in the larger effort, it is argued they degrade the coordinative efforts with the main response team. But closer examination of the cognitive work of software engineers revealed that these side channels were necessary and actually a net benefit to being able to quickly respond.

The problem was not the side channel itself, but rather not having a mechanism to give others observability into their activities which disrupted the main group. Resilient teams found strategies for allowing ad hoc, side channel groups to quickly come together around a specific problem and also keep that work integrated with the overall response effort.

About the Author

Laura Maguire is a researcher producing human-centered design guidance for Jeli.io. Her doctoral work studied distributed incident response practices in DevOps teams responsible for critical digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020, and her research interests lie in resilience engineering, coordination design, and enabling adaptive capacity across distributed work teams. Maguire has a Master’s degree in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering, with minors in Resilience Engineering and Design. As a backcountry skier and alpine climber, she also studies cognition & resilient performance in mountain environments.

InfoQ Software Architects' Newsletter

Exploring Costs of Coordination During Outages - QCon London Q&A

Write for InfoQ

Key Takeaways

About the Author

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter