InfoQ Homepage Presentations How Many Is Too Much? Exploring Costs of Coordination During Outages

How Many Is Too Much? Exploring Costs of Coordination During Outages

View Presentation

Speed:

Download

50:09

Summary

Laura Maguire uncovers the hidden costs of coordination, and shows how resilient performance is directly tied to coordination. Maguire examines problematic elements of an Incident Command System, using case study examples to describe helpful and harmful patterns of coordination in incident response practices.

Bio

Laura Maguire is a researcher at Jeli.io. Her doctoral work studied distributed incident response practices in critical digital services. She has a Masters in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering, with minors in Resilience Engineering & Design. Her research interests lie in RE, coordination design, and cross-functional adaptive capacity in distributed work.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Maguire: I completed my PhD at The Ohio State University. Then I started work recently at a small tech startup where I get to build off some of the ideas that I'm going to share with you today. This is a really exciting time for thinking about coordination in software engineering. The central idea of what I'm going to talk about here today at QCon is that we have a problem. It's a coordination problem. It's one that's hiding in plain sight in the work that you do every day. To prove my point, I want to do a little bit of an exercise here before we get into things. First, I want to get a sense of who is in the room. If you are a site reliability engineer, or you're responsible for service delivery of some sort, you manage a team that's responsible for uptime, put up your hand? If you are someone who is dependent upon someone else's service being reliable and being up, put up your hand?

Apollo 13 Mission Control

Given that almost everyone in the room put up their hand here, if what I'm saying, that coordination is a problem. It's a big problem. It affects all of us. What I want to do to prove this point is I'm going to show you a picture of an incident that's in progress. It's underway. I want you all to help me deconstruct it a little bit. I want you to tell me, to shout out what it is that you see that's going on in this picture that is helping these responders to be able to coordinate their event and to manage the crisis. This is a photo from NASA's mission control shortly after the oxygen tanks exploded on Apollo 13. This photo that's taken from that event shows us a lot about how it is that people work together during emergency situations. What is it that you see? What's helping them coordinate in this photo?

Participant: They have the same information.

Maguire: They have the same information. There are some shared frames of reference in terms of the displays they're looking at down here. Some of them are looking at displays higher up. What else?

Participant: Headsets.

Maguire: Headsets. We have the immediate action that's taking place right in front of them. Then there are people that they are coordinating with. Either they're receiving information or they're transmitting information to other folks doing other activity in other areas. What else?

Participant: They have specialisms.

Maguire: They have specialisms?

Participant: Different people with specialist's consoles, with special information for them.

Maguire: Because they're co-located, we can see multiple different groups. We've got the folks up front. We got a group in the back. We got a little sidebar conversation over here, which is, in Slack, you got threads running all over, DMs happening. We have multiple types of activities and of responders acting in the same room. What else?

Participant: Only a few people in the controls.

Maguire: There are only a few people in the control room. That's interesting. We have different groups, but there's only a few of them. Related to that, look at the glass in the back wall there, so there are only a few people in the room. It implies that there are a lot of people that can look in on the room. There might be something important and meaningful about being able to monitor the activity that's going on. How quickly is it happening? Who's there? What are they doing? What are they looking at? The design and coordination of this room, of this layout really matters.

Another Control Space

I'm going to show you another control space that might be a little bit more familiar to you. There is an incident going on here, very different control room. Here, or even here. The difference is quite a contrast between this designed for, co-located environment and these distributed virtual control rooms. The benefit of the design coordination that we see in the NASA control room, of how attention is focused in different ways, we have the right types of people in the room for the right types of activities. You can tell if someone can be interrupted. You can see how quickly or how slowly they're moving. All of this gives us a lot of information about coordination. In that room, you get a lot for free in coordination. In the virtual world, you can't just lean over and look at someone's shoulder and say, what are they doing? What are they going to do next? What are they anticipating to do?

While I know that there are a lot of differences between astronauts, and mission controllers, and software engineers, there's a fundamental basis about how humans work. How cognitive work takes place when you're handling uncertainty, when you're under time pressure, when you're managing risk, and when you're coping with interactive failures. All of this, under extraordinary time pressure. It's the same if you're trying to get astronauts back home safely, or if you're trying to get your service back up and running. We can find patterns that are really applicable across these domains, about people working in high tempo, uncertain worlds. There's another reason why studying coordination in your world is actually worthwhile.

Software is Increasingly Managing Critical Societal Functions

That has to do because, increasingly, you all are starting to manage some really important functions in the world. Things like 911 call routing systems, banking systems, electronic health records. Ensuring that your cognitive work and that your coordinative demands are well supported is actually really important.

That was a really long setup to get us here. It's extra important because all this stuff is just the water that you swim in. It's everyday activity. It often gets unacknowledged or underappreciated as something that we need to pay attention to. Here's the flow of the rest of where we're going to go. I've been talking a lot about coordinative work, coordinative demands. We're going to break those down a little bit further. Then I'm going to spend a few minutes talking really broadly about coordination in complex and adaptive systems. What makes that different than when you're coordinating with your friends to try and figure out where to go for dinner? Then excruciatingly, I had to distill my entire dissertation research into four key findings. I picked what I think are the juiciest ones for you. Lastly, I'll talk a little bit about some of the implications of this for your work.

I've been tossing around these words about coordination, cognitive demands. Ultimately, what this is about is the human capability to perceive the world, to remember things that are important to the task at hand. To reason about those, and the ability to focus in on specific and meaningful details, like a little flashlight shining in. Then ultimately, you're trying to act upon that reasoning, that sense making of the event as it's happening. The cognitive work is what goes into answering those questions. What's happening? Why is it happening? How quickly is it happening? What does that mean for my service? Then, what is it that I need to do right now?

The coordinative side of things is answering questions like, who has the skills and the knowledge to be able to help me with what I'm doing. How do I get a hold of them? What is it they need to know about this current situation to be able to make their knowledge and their experience relevant? What task should I give them versus someone else? How long is it going to take for them to finish? What work might we be deferring as a result of this? This is a really busy slide. There's a lot going on here. That's just because how it is when you're managing an incident in real-time. You have cognitive work, and you have coordinative work.

All of this comes down to additional demands, the cognitive costs of coordination. This just refers to all that additional effort that comes from working together. If it's a big deal, why would we actually coordinate? Why wouldn't we try to just push people off? Have less people in the room. No one's going to argue that modern continuous deployment systems can be handled without coordination. We have a lot of structures. We have a lot of processes in place to be able to do this. What I'm suggesting here is that those methods of coordination don't always address these additional demands that it places on responders. These first four demands are pretty straightforward. If those were the only ones that I was focusing on for my dissertation, it wouldn't be all that interesting. It's these last three that I really want to focus on, because it's what makes incident response in software engineering, specifically, very different from other worlds.

The fact that you have continually changing environments means that there's an implicit need that you have to be continually learning about how your system works, and about what it is that other people know about how your system works. The fact that they're complex and that they're interactive means the failures that you face are often quite challenging, and they require multiple forms of expertise. Lastly, operating at speeds and scale means things happen really quickly, and they have very broad consequences. Specifically, they have consequences about the need to coordinate multiple, diverse perspectives in real-time. It means it's not always clear which people are going to be important, in what combinations, and at what points in time as the incident unfolds. This creates a very dynamic, coordinative environment.

If we look at the progression of an incident, and if you think back to the last incident that you were involved with, regardless of whether it is that your system caught fire in a really big, dramatic way, or that there was just little whiffs of smoke. When there's an uncertainty, it triggers this expansion. These incredible, cognitive demands as you're trying to figure out what's going on and how bad is this? Then, because you're not entirely sure you start to bring in other responders, other people on your team to help you diagnose and detect what's going on. Your users start reporting, what's going on? When are you going to get back up? How bad is this? Do I need to stop what I'm doing? Then you have client customer service people who might pop in and say, "I'm getting a lot of pressure. What are you doing? When is it going to be back up?" Then you have other dependent services starting to get in touch. You have Tim, who shows up because Tim's always in every incident. If you're a highly regulated industry, you might be having to coordinate with the regulators about what's going on. As the event grows, you have more and more people starting to get involved. This drives the cognitive demands and the coordinative demands.

It results in a coordination paradox, which is, that in these complex systems, everyone's model is going to be partial and incomplete. We need lots of different people to handle these non-routine or exceptional events. They also cost a lot for you in terms of your attention. It begs the question, and this is what was really central to my research was, how do we reap these benefits of having joint activity without the cost of coordination becoming too high? The answers that I came across were actually quite surprising. What did I find? I found, first, that coordination during anomaly response doesn't actually work the way we think it does. That the models that we use to coordinate during incidents, specifically, that role of incident commander can actually undermine speedy resolution, and that the strategies that people use to control the cost of coordination are actually very adaptive to the type of incident that they're in. Lastly, the tooling actually adds costs of coordination, even when it's intended to reduce them.

Where did I get all this data from? I was a part of the SNAFU Catchers Consortium. Who knows what SNAFU stands for? Situation Normal All F'ed Up. This is the natural state of this world that you're constantly coping with things that are difficult to anticipate. Our consortium is a number of organizations who are interested in the research, who are interested in resilience engineering, and who wanted to partner with us to explore these topics more deeply. Some of you may have seen the report out of our first cycle, which was called the STELLA Report, and it's available for free download at stella.io, on coping with complexity. Then the second cycle has been looking at controlling the costs of coordination. Our partners allowed us a lot of access to their organization. These findings come from hours spent loitering by desks, lingering in ChatOps channels, getting up at 4:00 a.m. to jump on a conference call. Painstakingly going through hundreds of pages of transcripts to see what's happening in the real-time interactions that take place either on a web conference or in ChatOps channels.

Incident Response

The first finding here is about this incident response model. This is a pretty typical, and I would say, generally accepted model and understanding of how incidents actually progress. At a high level, you could certainly say that incidents actually follow this linear progression. You see it. You figure it out. You repair it. You fix it. You move on. For straightforward, really known issues that you might see in a mature or stable system, this might actually hold up pretty well. For systems that are operating at speed and scale and continuously changing, this is a real oversimplification. Because the incident response is rarely, if ever, these linear, sequential discrete phases.

Instead, there's a lot of hidden stuff in this. There's a lot of uncertainty in incidents. Is this even an incident? Or you might need to actually take action before you know what's going on? A huge aspect of the incident respond is trying to figure out who can help you, then recruiting them and bringing them up to speed, and keeping them meaningful and engaged. It's not always clear why something worked. You have this ongoing additional demand. Or you get an interruption in the middle of your response, and you need to redirect your effort to deal with some real stakeholder demands and concerns. There's residual uncertainty that continually having incidents creates. Importantly, knowing the impacts of past incidents on future actions is crucial for being able to keep this current mental model about where the explodey bits are in your system. It's the interplay between the cognitive work, the technical tasks, to take care of an incident. These need to coordinate with others that represents these additional demands, these costs of coordination.

All the knowledge about who on your team or in your network has the ability to help, in particular cases, this resides in the heads of responders. It takes effort to know. I saw many examples of ongoing investments to continually build common ground to not only learn about what other fellow responders know and didn't know. Also, what is it that the system is going to throw up for you in the next incident? Knowledge about which parts of the systems you can change without issues and which parts of the systems you have to tread very carefully. This is all kept in the heads of yourself or your fellow responders. Sure, it's in a backlog somewhere. In the heat of the moment, in the demands of the incident, knowing this stuff and being able to communicate this stuff really matters. It's often but not always related to the backlog, but knowledge about how potential problems can impact other dependent services also means that you can proactively notify them when things might impact them. This coordination has a proactive and a foresight to it as well.

Keeping track of what's important for others to know is not an inconsequential amount of cognitive effort. Actions that you took prior to other responders coming into the incident, or patterns that you observe before they arrived, these all needs to be communicated at relevant points in time during this high pressure incident. This can become quite burdensome. Lastly, monitoring the demands of the incident or multiple incidents over time, and then keeping track of how fresh your resources may be, is important to have the skills and abilities available to you when you need them. This takes effort to monitor and to manage, and particularly if you've had a really rough month with a lot of incidents. This original model is an oversimplification.

Role of Incident Commander

My second finding has to do with another form of oversimplifications. That is with the role of the incident commander. I realized that many of you in the room here ascribe to this model, and you may find what I'm about to say offensive. Know that I am a Canadian who was born in Britain, so I do not like to offend. When I'm saying this, it's because the data unequivocally showed me that this is true. The commonly espoused model in software engineering is based on this incident commander in which the IC comes into the incident and makes the chaotic mess of all the individual responders and tames them into an ordered and commanded incident. In fact, the Google SRE chapter on IC says that they hold the high-level state of the incident. They structure the incident response, assigning responsibilities according to need and priority. They hold all the roles that are not delegated.

I'm going to walk through some of the problematic elements of this. You've seen this before. It's a high-level view of the incident commander. They have a planning function, a comms function, and an ops lead. It looks very structured. It looks very ordered. The problem with this is that the incident commander has finite amounts of attention. We all do. That's just how we work. No matter how superhero-ish we think they are, there's going to be times where those demands overwhelm them. Staying current on when things might be changing, and also coordinating the activities is actually a lot of work. They delegate to the ops lead, who delegates to the responders in real-time. Gives them tasks, ask them for information. All of that gets fed back to the ops lead, who feeds it back to the incident commander. Who then makes a decision and passes out more tasks and more information, and so on and so forth. The problem with that, as I said, with the finite attention is that the incident commander can fall behind the pace of events as they're happening in real-time. They create a workload bottleneck that can actually slow down the pace of the event.

What we see as a result of this being slow and stale is that people can start acting independently of the delegated tests that they've been given. This freelancing is often talked about as being an anti-pattern and being a negative thing. In my research, it was directly related to this mismatch between the pace at which the incident was unfolding and the pace at which it was being managed. The engineers that I studied, they were all very competent. They were all very responsible and diligent, conscientious, well-intentioned folks like yourselves. They saw something that needed to be done and they acted on it. They anticipated the potential implications of the inaction and decided to act. This is actually a really important point because it shows that there's an opportunity to move a little bit faster in an incident if the responsibility is delegated or distributed amongst responders and not centralized.

Another interesting finding was the pattern of side channels. This was also deemed a bit of an anti-pattern as well. In the cases I studied, it was critical for getting things done. People hive off into DMs. They start a new channel. They go and stand beside someone else during an incident, and start to work in smaller groups. This is necessary because the cost of coordination get too high. It's too noisy in the main war room to be able to talk in a real direct manner about a specific task. These efforts require smaller subsets of people happening in concert. The problem with this and the problem with freelancing is if they don't get communicated back to the main event, to how it's being managed, it can actually be quite dysfunctional.

The incident commander role here has taken a little bit of a beating. Does that mean that they're no longer functional? That they're no longer useful? Some of you might be thinking, as you're listening to me talk through this, this would be chaos. It would be anarchy if we did away with the incident commander role. Based on what I saw from these last three years of research, things can actually be quite different. Here's a really simplified example for you to make this concrete. It's traffic. People in this situation, they know how traffic works. They continually adjust their actions relative to the flow of events. They're speeding up. They're slowing down. They're signaling their attention. This is a marvelous example of what it's like to be in a well-orchestrated incident. In the dance of the morning, rush-hour traffic, everyone knows the choreography for a smooth performance. How many of you have been a responder in an incident where it felt like this? It's pretty amazing, isn't it?

Incident Response Practices

One of the things that's interesting about this is that there was a lack of controllers in that situation. There was no traffic light. There was no police officer. Yet it worked. This is a very powerful and a super exciting way to think about incident response practices. Because the current paradigms that focus on the idea of a centralized command structure, there's an alternative out there. The participation, the anticipation, the ability to seamlessly synchronize activities in a larger joint effort is quite meaningful. We can see that if it typically runs smoothly, each agent in this distributed network is able to more fluidly adapt and adjust to the demands. We can lower the costs of coordination.

I'm calling this adaptive choreography. This thinking requires a very different way of thinking about the problem. It's my third finding. Like the dance, what it comes down to is being able to fluidly and dynamically adjust to how coordination happens. I've taken a first kick at what this model might actually look like. We still have the role of the incident commander, because there are times where decisions need to be made quickly. We need to know who that centralized authority is. We want to have someone who has that bigger picture in mind. These roles play a very different function in supporting the adaptive choreography of the responders in real-time.

What I found in the high-performing teams that I was looking at, was that in multiparty high-pressure events, very rapid, but very direct interactions amongst the responders typically worked really well. That's largely because they fulfill the functional requirements of coordination as they were carrying out their tasks. They're able to anticipate what needed to be done. What needed to be done next? They were able to take the initiative to do it. They listened in on what others were doing, and they were able to better sequence the timing of those actions. By maintaining a fairly low level but continuous awareness about the status of the event progression, they were able to actually provide input into critical decisions and point out potential threats and the implications of different courses of actions. They recognized more readily when someone including the incident commander had a partial or incomplete mental model about what was going on. They were able to very quickly correct and repair that breakdown in common ground.

If we go back to that example of NASA and mission control, many of those same functions are being supported there. Being able to look in, being able to listen in. Understanding what others are doing as a way of ensuring that everyone is up to speed on the most current state of that event. To me, this is the most important finding of my research, this model of adaptive choreography. I think that it really has the potential to disrupt how we do incident management. How we learn about incidents. Ultimately, how to deliver more reliable service with less cognitive, and coordinative demands on those of you who are handling these incidents.

In the Ask Me Anything panel that I participated in with Denise Yu, we started talking about what resilience is, and what it looks like, and how to achieve it. When I step back and I look at the literature from cognitive systems engineering and resilience engineering, and from these domains where we have studied a lot of adaptive capacity in other high-hazard, high-consequence environments, I would say that this is a good start for us to think about how to integrate resilience engineering into incident management practices.

This is ostensibly as far as academics get to do it. This is my mic drop moment. This is where I should say, "We figured it out," and I should walk away. Since you're technologists and you build tools, I would be somewhat remiss if I didn't at least speak to the tooling.

Tooling

My fourth point and final point is coming back to that revise statement about which people are involved. I will say, which machines, which automation, what tooling is involved, is important to know, in what collaborative interplay and in what sequence? This is a whole talk in and of itself to talk about this interactive, collaborative autonomy. There's a common theme here in that it's not just about a role or a task that a person can take on. It's about supporting those underlying functional relationships of coordination.

Often, there is an implicit bias that machines don't need coordinating with. That they do the stuff and we consume this stuff. When I took a really close look at some of the incident management tooling, how it was being integrated into incident response practices. It became very clear that the tools themselves have substantial costs of coordination. When you are in the heat of the moment, the heat of the incident, lags or delays in the web conference or the audio bridge can add additional cognitive demands. We've all been in on those conference calls, and we know what this feels like. You can also have glitches where things get dropped, and the amount of effort that's spent to going into making sure that the automation has current information about what's happening can actually be quite substantial. All of this tweaking and fine-tuning, this effort that goes into making sure that the tools are keeping pace with the change in the world.

In addition, stepping back from the heat of the incident and looking at when we're integrating tools more broadly into incident response practices, there was a huge amount of time that went into selecting the tool itself, just researching to figure out what does this thing actually do versus what they say it's doing? How can I test it? How can I pilot it before I roll it out across my entire organization? Launching it, all of the training and the communication that goes into that? Switching the practices that you already have underway, to be able to adapt to this new form of coordinating with the tool. Then that constant calibrating to help the automation understand that the world is slightly different than you think it is. All of this represented hidden costs of coordination. As folks who are able to design these tools, or who are responsible for integrating these tools, I think it's really important that we acknowledge these additional costs.

What did I find? I started out by saying that coordination was a problem, and that it was hiding in plain sight. Many of the examples that I tried to give here were ways of surfacing what we take for granted as being part of this background noise of handling incidents. By thinking about the thinking, we can identify and better support the cognitive and coordinative demands with tools or with practices. Just as NASA leads innovation by venturing into uncharted territory, and showing us what is possible. I believe that software engineering could lead a step change in incident management practices. It's my hope that you'll continue to push the boundaries of what is possible in how we coordinate.

Of course, because I think everyone should read more, I've assembled a list of references that elaborate on a lot of what I've said today. I do have an article that's coming out with InfoQ that summarizes the key concepts that I talked about here. If this is exciting for you, please get in touch. Lastly, if you are interested in chatting a little bit further about how to integrate some of these practices. There's an email address at the bottom there, workshops@jeli.io, and I'd encourage you to please get in touch.

Questions and Answers

Participant: In this adaptive choreography system, what heuristic chutes do you think people should use to decide when they should go to the incident commander, and when they should coordinate between themselves and another person in the incident directly without having to go up that hierarchical decision chain?

Maguire: What's implicit in there is that a lot of the tooling and the practices that you have in place are enabling a certain degree of observability. If you are coordinating an incident in ChatOps, for instance, people do have the ability, including the incident manager to monitor what's going on. What I was trying to show with the functional aspects of coordination is knowing when it's important to communicate that you are following up on an activity that's going to have implications for other people. The rate at which you communicate the status of that is going to be dependent on how significant it is, how quickly the incident is moving. Part of that does come from experience. It does come from the pattern that your group has established and how they coordinate. Part of that can be part of the incident commander role in terms of knowing when to prompt versus when to leave people alone to work, so that observability and the ability to listen in and keep pace with what's happening across the other responders.

Participant: I really like the metaphor of not having a traffic cop or the traffic lights, and everything keeps moving. It's frustrating sitting in the traffic light and you can't move. I think that acknowledges people sometimes just want to run the red and go. If everyone's slowly moving around and making steady progress, and then suddenly a truck comes flying into the intersection. That can be bad. How do you prevent that analogous situation from happening during this type of coordination?

Maguire: Would the incident commander prevent that from being disruptive to the incident?

Participant: I wondered if that was one of those roles that stays with the incident commander is watching out for the truck that's flying at the intersection?

Maguire: In the model that I proposed, that incident commander does still maintain that high-level bigger picture understanding. What the shift implies here is that everyone is poised to adapt. Everyone is expected that they're going to need to be able to adjust in real-time as opposed to we wait for direction. If you think about it in a basketball analogy, you're on your feet in a ready position, as opposed to standing waiting for the coach to tell you what to do.

Participant: I'm from the rest of Europe. I don't drive on the left side of the road like you. I'm a new person, going back to your traffic analogy. How do you add me into the system?

Maguire: This brings up a really good point is that as we start to have cross-boundary type interactions, where now you've got third-party engineers that you need to bring into your incident response, or your network team, or someone who's not part of the typical pattern? They're not learning a process. They're enacting the similar types of functions to be able to communicate what's happening, anticipate what's going to happen next. In that instance, if you have someone who's coming into an event who isn't familiar with this mode of adapting, they're going to need more effort to coordinate with. That means that I, as a commander, or, I, as a responder, I'm going to have to anticipate that we don't have a lot of common ground. That you're going to need extra communication. You're going to need extra signaling to help you understand, what's appropriate action? In that context, you might have that incident commander role more directly delegating those types of people.

Participant: How did you measure the cost of coordination?

Maguire: There are a lot of measurements that take place that have to do with eye tracking and delays between things. I used a more qualitative approach in the sense of, what was additional effort? Where was there additional communication happening? Where was there extra activity that had to take place? It was a very qualitative measurement.

Participant: Were there any other factors that affected the cost of coordination?

Maguire: It sounds like you have something in mind.

Participant: For example, distance, let's say, about distributed teams, faster in coordination than teams that are locally there.

Maguire: When you're asking about the other factors that were involved in coordination. I think distance matters 100%, this idea that you can have a geographically distributed team that's going to function and operate the same as a co-located team. It's a dream. It's a fantasy. There are examples in other high-hazard work where that fantasy is being applied to real-world settings. We're pulling engineering teams off of offshore oil platforms and locating them onshore. We're looking at aggregating air traffic controllers, pulling them away from the airfield, and centrally locating them. These types of things where we start to understand, what is the impact of these distributed teams really matters. There is a huge body of work in the computer supported cooperative workspace that looks at these things specifically.

Participant: Have you considered whether the adaptive choreography is applicable to the leadership scenarios?

Maguire: In what sense?

Participant: It seems that there's been a lot of research, as I understand, into the effectiveness of command control approaches, especially in military circles trying to move towards something where units actually work in more autonomous ways. Have you seen a lot of parallels there? Have you looked at different research in those areas? Do you think there's some mirroring there?

Maguire: I think that that speaks to this idea that we need to challenge these fundamental paradigms about how we think work needs to happen. Part of that drive for structure and for command and control has these implicit assumptions about who are the people that are doing the work. Oftentimes, the reason why we're using these rigid structures is because we think that those people need to be controlled, and they need to be constrained. That they're not to be trusted to take action that is in the best interest of the broader team.

Another view on that paradigm is, you are closest to the action. You have more knowledge of that system and of what's happening than someone in a different part of the organization. Pushing that deference to expertise down to the person who knows the most about the situation, and being able to flexibly and fluidly adjust that is really what many domains are recognizing is a better way to organize, a better way to coordinate. I think the example you gave of military applications is a great one, because war is not bad guys over here, guns facing this way. Good guys over here, guns facing this way. It's much more emergent. It's much more adaptive. You have to organize in ways that match that variability that you see in the world.

Participant: What do you think is the best way to adapt this way of working in incidences today? How do we teach the team how much autonomy they have in incidence scenarios? Is that even a doing thing, or is this something more prescriptive to this [inaudible 00:42:25]?

Maguire: I think it's different in prescription. He was asking about, how do we integrate this basically into practice? Is this about training? Is it about tooling? How do we do this? It's a really good question because you don't drop a new model, even if it's a model that I think is a better one, into a pristine environment. You drop it into a world that has existing pressures, existing constraints, existing precedent about how things take place.

I think that a really good starting point is emphasizing learning more. I know the Twittersphere is going off about there is no root cause. Really, instead of trying to push into these, what are the technical causes of an incident and instead trying to look at, what are the patterns that surface that made this problem harder or made it easier? Then, what were our responses to those challenges and difficulties that either helped us or that hindered us? It's a good starting point for teams that are thinking about, how do we integrate new ways of organizing into incident response practices? That being said, my dissertation research, I'll be publishing some results out of that. That do get a bit more prescriptive about, how do you set up roles? How do you organize the system of work around this new type of model?

Participant: How much power does a responder have, as in say, there's a conflict of interest between the CEO or the CTO, and the responders collectively have a different interest? Can they override someone of a higher position?

Maguire: You sound like you have a specific.

Participant: Yes. During a security incident, the responders are investigating. The CTO says, "Kick them out. Kick them out of system." At that point you haven't done any root cause analysis, or anything, and you focus all resources to kick them out. That might hinder the business operations. At that point what power does the responder have?

Maguire: It's an interesting point, because what's implicit in this is that when we're talking about accountability, responsibility, authority, we are talking about this shifting, in some degrees. The question over here spoke to that. There's always going to be power dynamics in an organization. There's always going to be priorities, different multiple competing priorities. From the CTO perspective, it's getting the person out of there. From the responder perspective, it might be gathering more information about how they got in there, or what problems could they actually solve, which gives them more information to be able to secure the system. You're both trying to get at the same thing, to secure the system, but going about it in different ways.

The ability to flexibly adapt and adjust in real-time is contingent upon, how do you handle those trade-offs? How do you communicate about those trade-offs? If that CTO or another person in a position of authority is likely to come in and try and redirect efforts, then bringing them into the incident, so they are up to speed, so their mental model of what are the multiple competing priorities that people are dealing with, is really important. Just like you would any other responder or any other person who's trying to come in and take action. It's thinking more broadly about, how do you pull people into the incident and at what points in time do you shift and adjust?

Participant: In your traffic light example, the agents in the study have relied heavily on the brain's ability to anticipate direction, and speed, and location in the future. To do that in a cognitive context, meaning that to anticipate people's actions due to their thinking, that requires quite a high level of communication. What speedy communication would you recommend? How would you facilitate that in a team of a larger than small size?

Maguire: What speed of communication?

Participant: No. What communication to optimize for speed.

Maguire: Typically, what we tend to find whether we're studying an emergency room department, a team that's dealing with a crashing patient or nuclear power plant operators that are trying to prevent a meltdown, is that, experts have a very highly encoded, very terse way of being able to interact. You can say something that doesn't have a lot of meaning to anyone outside of your response team, but is very informative to people within the response team. What I'm saying is that, the ways that you're communicating, what you're communicating are going to be relative to who's involved with the incident. To the question up here, how much common ground do you have?

One of the other really interesting findings in the research really had to do with this idea of common ground, which is, the people that are involved in the incident all have to have shared knowledge, beliefs, and assumptions about what's going on. What's important? What are the priorities of this event? The more common ground that you have, the easier it is to have these really short, encoded interactions. You don't have to spend a lot of time explaining yourself. That investment in common ground is made before the incident starts. This is learning about what the other members of your team knows. It's learning about what changes are happening in different parts of the organization, or dependent services that might have an impact for yours. It's about having this much broader view. My suggestion for that is to actually think about in advance what is the way to establish a richer common ground in order to interact the way you normally would with other responders.

See more presentations with transcripts

Recorded at:

May 11, 2020

Laura Maguire

InfoQ Software Architects' Newsletter