InfoQ Homepage Podcasts Courtney Nash Discusses Incident Management, Automation, and the VOID Report

Courtney Nash Discusses Incident Management, Automation, and the VOID Report

Apr 22, 2024

In this episode, Courtney Nash, a researcher focused on system safety and failures in complex sociotechnical systems, discussed the latest edition of the VOID report. Topics covered included: incident management and the role of automation, working effectively within socio-technical systems, and the value of collecting and analyzing system metrics in the good times and the bad.

Key Takeaways

The VOID report for 2024, titled "Exploring the Unattended Consequences of Automation in Software," emphasizes the need to scrutinize the often overlooked impacts of automation, including AI, on software systems.
Organizations need to learn from incidents by advocating for the sharing of incident management data within the community, as seen in the aviation and medical communities.
Socio-technical systems play a vital role in software development and incident management. People need to understand and work effectively with these systems rather than focusing solely on technical solutions.
It is important to collect, share, and analyze metrics related to incidents, near misses, and normal operations within systems.
Conduct workshops, talks, and other forms of engagement to educate developers on how to implement the recommended incident management practices and strategies.

Subscribe on:

Introductions [00:51]

Daniel Bryant: Hello, and welcome to the InfoQ podcast. I'm Daniel Bryant, and recently I had the pleasure of sitting down with Courtney Nash, author of the recent VOID report and long-time contributor to the incident management space. In fact, I last bumped into Courtney at QCon New York last year, where she delivered an excellent talk on this topic.

In this podcast, we explore the VOID report in depth and analyze the approach and key takeaways. We also explore what we can all do to learn from incidents. We cover strategies and tactical advice, and Courtney provides a strong call to action for everyone to share incident management data so that the community can learn from this. We discuss a wide range of topics, from DORA metrics to working with socio-technical systems and comparing apples and Volkswagens. Trust me, this all makes sense by the end of the podcast.

So welcome to the InfoQ podcast, Courtney. Could you introduce yourself to the audience, please?

Courtney Nash: Absolutely. My name is Courtney Nash. I am the founder of the VOID. My unofficial title is Internet Incident Librarian. And yes, I have a background that many of you may be somewhat familiar with. I was an editor at O'Reilly for almost 10 years, edited the Head First series of books and many others, SRE book and a couple of different conferences. So now I've turned a lot of the things that I got curious about during that time into a really fun research program with this work on the VOID that I'm happy to get into here.

Daniel Bryant: Yes, fantastic. Courtney, I've got to share. The Head First books were game changing in my career, so I would just share about that. They're fantastic. Fantastic work, right? Yes, I can see in the background... For the listeners in the background, Courtney's got a lot of books, Head First books on her shelf, which look fantastic.

Courtney Nash: That makes me happy to hear.

Daniel Bryant: So yes, they were amazing straight up.

Can you explain the purpose of the VOID report and share the key takeaways from this year’s report, please? [02:25]

So yes, the VOID report. So the 2024 version is called Exploring the Unattended Consequences of Automation in Software, which is just a fascinating title on many levels. I think many of us like automation, even AI and how that plays a role in this as well. But could you set up the premise of the report for the listeners, Courtney? Maybe a brief history would be good as well.

Courtney Nash: Of course. Happy to talk a little bit back about the VOID. So I started the VOID back in 2021 and part of it was just a lark with the folks that I was working with at the time. I was doing research for a product and it was very heavily Kubernetes and Kafka based. And as you can imagine, those work all the time.

Daniel Bryant: Indeed.

Courtney Nash: So I wanted examples of this stuff failing in the wild, which is not always so easy to find. So I started looking to incident reports, public incident reports, and there were a few collections. Dan Liu had a great of Kubernetes AF. I think that was him. I don't know. I can't remember which was which sometimes. And Lex Neva had a great just ongoing archive-

Daniel Bryant: Yes.

Courtney Nash: ... through SRE Weekly for a long time. He always included a couple of incidents. So I started from there and then people started sending me things, and at some point, I looked around and realized I had 2000, 3000 of these incident reports. So I couldn't just stop there you. Why would you?

But the discussion really came up. There is no place for these things in our industry. There is no library, if you will, of incident reports. And in other industries, to some degree, there at least is. If you go back to the history of aviation in particular, there are people who had worked to pull together, largely anonymized in that case, kinds of incident reports. And the sharing of those, the collective knowledge building really helped move their industry forward. I know things have gotten a little weird in the last year or so for aviation, but when I started telling this story in 2021, I could really harken back and say, look, that industry got a lot safer by discussing their incidents and by sharing that information with each other really, at least within the industry, very openly. So we thought, okay, we're on our way. Let's do this.

So we built this up and now we have over 10,000 public incident reports. These are not me going in and analyzing your incidents. These are things that people in the industry write up and publish. And there are also things like news articles, and Tweets. It's a big collection because we have broad research goals that we want to tackle with all of that. So that was how that started. And we dove into some of the early what we considered to be myths of incidents and a broad spectrum there again... incident response, incident management, incident analysis, attitudes and perceptions about that... because we finally had real data instead of just attitudes and perceptions and gut.

So we did some work that I was inspired by an engineer at Google around MTTR, for example. So mean time to respond, remediate, whatever you make out of that R. We finally had the data at hand to show how much variability there is in that for incidents and how it's not really the best metric for what people thought it was, for example. Like if you're more resilient or more reliable, your MTTR will go down. Statistically speaking, we found that really not to be true. And what was really interesting was that was a key component of the DORA metrics, and that team actually responded to that. We had some conversations with them and they're shifting some of the things that they're encouraging people to look at along those lines.

So some people will argue with me about whether they like where that's going, but I consider that to be a really strong indication that having these kinds of data helps us make better decisions. Did the same thing around root cause and at the same time, for example, Microsoft Azure team was shifting their mentality around that language, that attitude towards can you pin it on something in an incident? And we can get into that for sure, we will with automation, actually.

And a big organization like Microsoft Azure actually moved away from... It may sound like semantics saying that they do root cause analysis, but they've shifted to what's called a post-incident review, and they're actually now publishing giant video postmortem kinds of things with their customers and with executives. And so I feel like we're just seeing the beginning of a bit of sea change. And I'm not saying we caused all of that, but I think we're participating in it. The time was ripe.

So after doing a bunch of work on that, I really felt like we had to look into automation and then by corollary, AI. I mean, if there was a time that was ripe for something. The hype is real. And so we had done these other very quantitative types of analysis in the previous VOID reports, right? So MTTR, we were looking at duration, we were looking at severity, we were looking at these kinds of things in very quantitative ways. And that was great. I mean, I think that really was helpful.

To look at automation, we had to take a slightly different tack. So we looked at it from what's considered more of a qualitative methodology, a very text-based type of analysis. The nerdy researcher term for this is thematic analysis, but basically, of that 10,000 reports, we whittled it down using keyword searches. So looking for things that were... You can look for automated, automation. It had a big set initially of maybe 500, 600 incidents. And then I went through and read every single one of them, which is really just the glamorous side of research y'all.

Daniel Bryant: Indeed.

Courtney Nash: But in the end, it ended up with a set of about 200 that I felt had some indication of automation being involved in some way or another. Granted, I think the corpus has way more than that, but this is... A public incident report only gives you so much. This is one of the things I talk about all the time about in the VOID is there's limitations to the data. And anybody who's ever been involved in any kind of incident and incident analysis knows that whatever goes out onto a website somewhere is a very different story to some degree, admittedly and acknowledgingly, than what you might talk about internally. But reading the tea leaves as best as we can, we went through.

And so the thematic analysis, you basically take a big body of text and you code it. So you look for similar phenomena if you will. And so I went through and started developing these codes and all the nerdy research stuff I read about in the report. We revise them, we revisit them, we do all this stuff, but ended up with a bunch of themes or categories or ways in which automation shows up in incidents. And I was really particularly surprised by how varied that was.

I think anybody who lives in these complex system knows that automation can definitely go awry-

Daniel Bryant: Oh yes.

Courtney Nash: ... but it played all these roles, in the report I refer to as archetypes, and it plays sometimes in one incident, automation will play multiple roles. So it might be how you detected it. It might be part of the problem. It might make it easier to resolve. It usually more often makes it harder to resolve. And so there's all these characters that it can play. And so I thought that alone was just really fascinating to think of it as it's not binary, pun sort of intended. But I think the other really, really critical thing that came out of the analysis was how often humans have to intervene in order to resolve or deal with an incident that involves automation.

And here we did a little making qualitative quantitative. It was about 75% of the incidents that we looked at humans had to step in. And so I think when I look at the themes and the lore, if you will, of automation, and then we can get into AI, there's very little of that. It's pitched to us as automation's going to do the stuff that humans are bad at or do the stuff that humans don't want to do or free us up.

And I talk about a bunch of the research from these other domains. So this is the other key point. We have a lot of research on automation and complex systems from aviation, from healthcare, from nuclear power plant systems, from all of that that really points to automation cannot be what they call a good team player.

And then really what it comes down to is our mental model of that, and I'm alluding to this, but the prevalent mental model is automation is good at these things, humans are good at these things, and we're going to set each other on these tracks, and we're just going to chug down the tracks, and everything's going to be great.

But that's not how complex systems work. I mean, I don't know if that's been your experience, but do the computers just go do their things and then you get to sit over here.

So it's funny that we have this lore, and we swim in this mythology around it, but we all know we're half drowning in the deep end on a regular.

So those data really illuminated that paradox, if you will. And in one of the research papers that I talk a lot about, they call it Ironies of Automation, and this is Lisanne Bainbridge, and this is from the '80s. They were already onto this. So we're catching up ironically in the software world.

So I think that it's not that... People say, "Oh, are you saying we shouldn't automate things?" I'm like, "No." And AI is not going away.

Daniel Bryant: Yes.

Courtney Nash: So it's us about readjusting our mental models and rethinking how we build these systems to help us, actually help us, not eliminate us. So that's the top level gist of what we found around automation and software from the report this year.

Should we retire the DORA metric of mean time to recovery (MTTR)? [12:20]

Daniel Bryant: Yes, fantastic primer there, Courtney. A couple of things that jumped out as I was reading the report, because I picked up on you said we should retire MTTR as a metric, and then as the organizations are moving away from short-sighted approaches like root cause analysis, and I was thinking, [inaudible 00:12:35], some people have not even got their head around MTTR or RCA yet, and you're telling them to move on. Do you know what I mean?

Courtney Nash: Okay, fair.

Daniel Bryant: Is that the case? As in, do you think people should jump over it? I've heard the same criticism leveled at some of the DORA stuff. Again, it is meeting people where they're at. And I love learning. I'm always on the back track to be on the vanguard of these things, but I appreciate not everyone I work with is. So what's your thoughts around that? As in, should we still be looking at MTTR for the average organization, so to speak?

Courtney Nash: Yes, average organization is ironic in that. First of all, I just have to say that DORA did more for our industry than most research out there. I'm a huge fan. I think people think I'm being overly critical. I was nitpicking because what they did was they picked the industry standard at the time and no one, them included, no shame or blame, had inspected that metric. We just all took it for granted.

Daniel Bryant: Makes sense. Yes.

Courtney Nash: And I'm not saying other people don't have this. They do. But I have a data science... I was a neuroscientist nerd for a while. So when you can get your hands on actual data, you can look at the distribution. Right?

Daniel Bryant: Yes.

Courtney Nash: So I think we just hadn't collectively reckoned with that. But I think a lot of what else DORA has done has been huge because they've really helped people focus on-

Daniel Bryant: Oh, a hundred percent. Yes.

Courtney Nash: ... developer experience culture. So anyway, should the "average" organization focus on MTTR? No, because averages are lying to them.

Daniel Bryant: Right.

Courtney Nash: This is the problem with the nature of those data because they're so skewed. So to have an average, a mean, you have to have a normal distribution of those data. And we have lots of those in the world, but we have lots of non-normally distributed things, and then you can't take the average. People always say, well, then you could take the mean or the median or the mode. They're so skewed y'all, that it really doesn't help. And then the other piece that came out of the MTTR research is the only organizations that were starting to possibly show some value out of that were having thousands of incidents a year-

Daniel Bryant: Right. Okay. Yes.

Courtney Nash: ... because they're so large.

Daniel Bryant: Yes, that makes sense.

Courtney Nash: Like Cloudflare has... and they publish every single little teeny-tiny incident.

Daniel Bryant: Yes, they're awesome.

Courtney Nash: I love them for that because they're an incredible source of data. And while I'm at it, I'll just add that Google and Microsoft also are, and there's one other cloud provider that just doesn't do that, but maybe one of these days, they will if I throw enough shade at them.

Daniel Bryant: Fair enough.

Courtney Nash: I'm not judging them. I'm not saying, oh, Google's better at responding to incidents. Because you also can't do that from those data. Those data don't tell you that. When you have noisy and skewed data, you have to have a lot of them, or you have to do weird stuff like log transform them, and then they're in a format that means nothing to anyone in your organization. So it's a hard one because it's like prying what feels like the most valuable tool in your toolkit. And I know people have almost said that, like it feels like you're taking my toys away from me. And I'm like, I know, I know. And I'm really sorry. But that is the nature of complex systems. It's very hard to put a number like that on them and have it be meaningful.

And then why do I care people ask me. Let people have their number. Okay, but I've also seen people incentivize these things. I know for a fact, engineers and incident response people who just had their organization assign OKRs that they have only so many SEV whatever incidents and only X incidents per quarter.

Daniel Bryant: Yes, right.

Courtney Nash: So now we're saying, OKRs, okay. OKRs, oh wait, people's bonuses, people's...

Daniel Bryant: Yes, how you incentivize.

Courtney Nash: Now we're weaponizing these incentives on these metrics that are meaningless. So that's where I start to care. That's why I think it matters. And also, you get the whole thing of the minute you make a metric a target, you change people's behaviors. Have you seen the XKCD cartoon?

Daniel Bryant: Oh, yes, the XKCD. Yes, yes.

Courtney Nash: Okay, I'll send it to you. Maybe you can link to it. This one, it's two manager guys and one of them is like, "Yes, I've heard that people are assigning targets to these metrics and that's causing all these really bad behaviors." And someone's like, "We should make a metric out of them making targets out of that." And then it's like, "Oh God, buddy."

So that's why I care, is at the end it's really about the people, at the sharp end, as we say, of these systems. And they are the ones who are responsible, as our research shows, for making them work. That's the other thing is we focus so much... We don't study normal work very much.

Daniel Bryant: Oh, interesting. Yes. Okay, interesting.

Courtney Nash: We study where things go wrong. And I love that, learning from incidents, learning from failure. That's value in there, but again, your sample size is much smaller-

Daniel Bryant: Yes, it makes sense.

Courtney Nash: ... for the normal.

Daniel Bryant: Normal running. Yes, yes.

Courtney Nash: I just gave a talk with Dr. Laura McGuire at SREcon, and we were trying to study trade-off decisions in incident management. So when you have to roll it back or you have to potentially, are you going to have data loss? How do these things happen? Because again, she has really an amazing background in conducting that kind of research and we were just trying to study normal work. Like when it's high pressure, high tempo, what kinds of information do you as an incident responder or a software engineer need to collect to help your manager or senior leadership decide what to do if the shit's really going to hit the fan? I don't know if you have to bleep me out on that one.

Daniel Bryant: It's all good.

Courtney Nash: So yes, so studying normal work is something that we're trying to look at doing more of. What are we doing that works? What are we doing that's successful? How do we do more of that? Versus worrying about why did somebody fat-finger a config file? Although if you dig into that, it's probably because they were tired or overworked or production pressures. But those don't always get dug up. This is the socio-technical side of these systems. I love that you all at InfoQ really do try to focus more on that, the people and the roles that they play in making this all work.

Daniel Bryant: Yes. And in fairness, Courtney, like yourself, John Allspaw, Nora Jones, there's a lot of folks that are definitely bridging the gap there because I think there is a tendency at InfoQ, myself included, to look at the technology, to look at the architectures, those kind of things. But socio-technical, like I a hundred percent support that. Throughout my career, I've seen it takes two to tango, so to speak, the social part and the technical part as well, and learning more about both and how they interact is key I think.

What are the benefits of automation in relation to dealing with incidents? And the challenges? [19:24]

Before diving into some of the... I wanted to go definitely into the automation. What's the benefits for companies that do have a mature or good practice in this space, Courtney, in terms of like one thing with the DORA metrics, like high-performing organizations are correlated with this. I always enjoyed, I think many folks enjoyed, understanding that. Is it the same with incident analysis as well? What's the correlations?

Courtney Nash: So no correlations yet. I harbor a deep suspicion that organizations who invest in learning from incidents, invest in the capacities to perform that kind of work will have certain competitive advantages over other organizations. I have no data to prove that yet. I'm at the early stages of trying to start to even think about how we could possibly demonstrate that.

But the anecdotal experience from organizations, people at organizations that I think are more sophisticated in this and that are investing in having dedicated incident analysis roles, are that the experience of those engineers and their confidence in their ability to handle their systems and understand their systems better is definitely higher.

Part of the report, you may have noticed, is the 2024 report, the preliminary piece of it, we did field a really introductory type of survey for folks in the space because we also realized we don't know what anyone's actually doing beyond the people that I talk to. You mentioned like Nora Jones. She started this learning from incidents community-

Daniel Bryant: Of course. Yes, yes.

Courtney Nash: ... and we have a large Slack community that a bunch of us are in. And so I feel like my perception also, like you said, the "average" organization, I'm talking to people who are largely, at least personally and intellectually, at the cutting edge of this.

Daniel Bryant: Yes, that makes sense.

Courtney Nash: And so I think my own understanding of "what people are doing" is very skewed, and I know this. And so I wanted to field a small study to give me some inklings, and then we have high hopes to really field a much bigger one this year that might start to tease apart some of the questions that you're asking there. Like if you invest in these things, what other kinds of outcomes or what kinds of things do you see? But the only thing that really came out of that was some directional ideas that organizations that do invest in incident analysis in ways like they have dedicated roles for it, they have executive support, they have it considered as a funded, a program, and maybe even dedicated tooling, while that seemed to be less important. The people in those organizations had higher confidence around their ability to feed that information back into their organization in ways that are useful and beneficial.

Daniel Bryant: Right. Yes.

Courtney Nash: So are you feeding those back into your architectural planning? Are you feeding that information back into your product planning? Because for internet services, resilience and reliability are pretty mission critical, but we don't treat them that way in terms of where we fund things. So another irony I would say.

So yes, to answer your question, I harbor a deep suspicion that organizations that invest in this are going to have certain both organizational and competitive advantages, but I don't have the data to prove it yet.

Daniel Bryant: Yes, stay tuned folks, and also get involved, to your point. You need more people to give you data and that would help you and your team very much though, right?

Courtney Nash: Yes. I mean, this kind of research is focused on people because we know that people are really important to this stuff. We can go do all the research on your architecture. There's plenty of people out there doing that. But we need people to help us do this research. We can't just go look at your architectural systems and say, here's what's happening. And so I know you're like, oh God, another survey, or is this just marketing garbage? And I promise you it's not. I mean, Laura, Dr. McGuire, said at the end of our talk, and I just said this again, but these kinds of digital services are mission critical to our world now. Right?

Daniel Bryant: Yes.

Courtney Nash: Planes fly us from here to there, but the internet does everything else.

Daniel Bryant: Yes, pretty much. Right.

Courtney Nash: It's important and if we don't participate in this work and share this work and share our experiences and publicly discuss these kinds of incidents, we can get past the PR and marketing fears of what that will mean, right? Because it's actually mission-critical now for humans to make this stuff work better.

So yes, that's the rallying cry. Please take your time. It won't be too much time. Answer our questions. Help me help you. Right?

Daniel Bryant: Yes. No, fantastic. I'll put some links in, Courtney, for sure, because I've read through the report and subscribed to everything.

Courtney Nash: Thank you.

Daniel Bryant: So yes, I totally get it, as in we need folks to contribute. And then the flip side of that is they'll get the information back and make their practices better. So everyone wins.

Courtney Nash: We saw it from DORA, right? We saw that.

Daniel Bryant: Totally.

Courtney Nash: I feel like that was the most concrete example of that. So yes.

Can you explain more about the automation archetypes? [24:28]

Daniel Bryant: Fantastic. I'd love to dive into something you hinted earlier, Courtney, around automation has multiple different roles in incidents. And I did like the archetypes, so I was reading down through as well. You had, was it the sentinel, the gremlin, meddler, unreliable narrator, spectator and action item? And the nerd in me loved that breakdown of the different personas involved.

Courtney Nash: They became characters to me.

Daniel Bryant: Totally.

Courtney Nash: Yes, they became characters. I was like, oh... And I'm sure people who you live in these sometimes are like, oh, I know that one.

Daniel Bryant: Yes, definitely. The unreliable narrator, I've been there many times back in the day when I was coding and running teams. Yes, I can recognize that all too.

I'd love to break those down a bit, Courtney, as in-

Courtney Nash: Sure.

Daniel Bryant: ... and perhaps relate that a little bit to some of the things you mentioned around AI as well as. The way you define automation could be interesting for their listeners as well. Because you must see software is automation embodied, right?

Courtney Nash: Yes.

Daniel Bryant: But AI is adding a different thing to it as well. So I'd love to get your take on a lot of things there I guess, but I'd love to hear your thoughts on those.

Courtney Nash: Yes, I mean, you could spend the rest of our careers nitpicking what is and is not automation, and we could go down many rabbit holes around that. So I did try to put a fence around it in a way that we could talk constructively about it.

So I'll set AI aside for a second, but the way we in the industry typically talk about automation is a computer doing something instead of a human. Right?

Daniel Bryant: Yes.

Courtney Nash: And typically then the secondary aspect of the definition is in a repetitive and potentially much faster way. Right?

Daniel Bryant: Yes.

Courtney Nash: And that's why we turn to it. That's why it's appealing, is the notion that it can do things that we could do theoretically, but it does more of them and faster.

Now better is the third layer that we often put on top of it. And I think that's where we start to get ourselves into trouble. And here's the interesting thing. We like to say, oh, it's not making decisions in the way that when we start to talk about AI, but in a complex system, there is an aspect of that because anybody's had the feeling of you feed something that expects this, it accidentally gets this, and so then it has to do something. And so the ways in which automation fails are where it gets really interesting. Those failure modes are often unexpected, surprising, and confusing.

And that's the piece that in particular, the research from Lisanne Bainbridge and then Nadine Sarter and a few other folks really focus on that surprise. So Lisanne Bainbridge, and we can give you the links for this, is the Ironies of Automation. Nadine Sarter talks about automation surprises.

But we can't imagine the whole system that we're going to put this automation into, a big complex system. The very definition of a complex system is that one person can't possibly model the whole thing, and also, it's not linear. So inputs have unexpected and unpredictable outputs. A linear system, you can... I mean, automation came out of the way we think of it here. Our mental model is the Ford assembly line. It's great for that. And I think people are like, oh, you hate automation. I'm like, I automate all kinds of crap in my life, y'all, just not other things.

And so when you can model the system and predict it, and then you can go, okay, automate this part. And then even when the automation breaks down, it's really easy to diagnose, right?

Daniel Bryant: Yes.

Courtney Nash: Oh, here's what happened, this didn't ... And when it fails, it doesn't fail in typically unexpected ways. So those are the key things to think about. So you're in this complex system.

So the first part of that that I said is we can't model the whole thing, but we're going to design stuff into that. So you're going to have auto-scaling, or you're going to have all these really pretty complicated automated things that say, for example, my favorite whipping pole, Kubernetes does. And then you go, here, go do all this stuff, and you're not privy to it anymore. Okay?

Daniel Bryant: Yep.

Courtney Nash: You can't see it. You can't introspect it typically because we don't design these things like that. We make that hard.

Daniel Bryant: Yes.

Courtney Nash: Okay, we'll come back to that. This is the same argument that people make a lot about autonomous vehicles, and this is all from the research from airline cockpits. You go do this, I'm over here doing blah-di-blah. Beep, alarm goes off, something goes wrong. Okay, now, oh, what happened? I don't know. Right?

Daniel Bryant: Yes.

Courtney Nash: And so now, first of all, I'm interrupted from another thing that I'm doing, and second of all, now I have to figure out what this ball of tar is doing that I don't know. And so one of the ironies of automation is that we put it in to make it easier on humans, but dealing with it when it goes wrong is actually harder than the normal state.

Daniel Bryant: Yes, that makes sense. Yes.

Courtney Nash: Okay. And so that comes from all the other domains too. This is a very common phenomenon in medicine, in healthcare, in aviation, in all these places. That's a part of it because of our mental model of it.

So one of the things that I recommend in the report is we want to make automation a joint player. We say a good team player. Joint cognitive systems is the fancy nerdy research term for that. But rethinking the mental model a little bit of how we build automation into our workflows, into our world, into our lives. Especially in tooling for us, we have so much. How can we make it easier to introspect? How can we know which dashboards to look at? Those kinds of things.

I wish we could take the sheer amount of time and money that we put into product UX and put it into developer UX for tooling, for incidents, for these complex systems that we have to exist inside of. So I think that's really where I landed with all of this is, of course it's not going away. And of course, automation can be incredibly valuable to us, but we need to reconceptualize how we use it in the kinds of systems we use it and change that mental model. So automation in a simple linear system versus automation in this more complex system. They're not the same. Let's not treat them the same. Let's make our experiences with those more productive to what that looks like.

How can AI help with a system’s explainability or understandability? [30:53]

Daniel Bryant: Yes, no, it makes a lot of sense, Courtney, and I guess I'm always trying to meet people where they're at. And as you were talking there, I was thinking some folks are going to be just getting their head around automation as we understand it as it is now, and then AI is going to layer on another aspect. Although I do wonder, can AI help us with that explainability, with understandability? So I'm seeing a lot of work in that early days, I think, right?

Courtney Nash: Yes.

Daniel Bryant: But where should folks focus on? Should they just get the basic automation principles and build a socio-technical system around that?

Courtney Nash: Yes.

Daniel Bryant: Or do you think folks should be leaning towards more of these AI solutions that are popping up?

Courtney Nash: No, don't jump the shark. Right?

Daniel Bryant: Fair enough.

Courtney Nash: I mean, I have a lot of spicy opinions about AI. I mean, the argument I hear the most is, oh, it's just going to be better and smarter. And I'm like, okay, but we still haven't ... I mean, we're the ones making it. Okay.

Daniel Bryant: Yes. I see your point.

Courtney Nash: There's a logical black hole there that I feel like we are just like ... it's like, what's that other cartoon that's like a physics thing, and it's like here, and then a miracle occurs. And I'm like, we're still just in that phase of AI. Explain to me how, if as a human, you're confused by that system. Unless you've given the AI access to all of the inputs to that that you didn't know you needed to know, how the heck is it going to be better at diagnosing that than you are?

Daniel Bryant: It's a fair point.

Courtney Nash: This is like we can't model the whole system. The AI has to know where to go look. That whole it will be so much smarter, so much better is so much further out than we think it is. And again, the thing ... and I know people think I'm some sort of complaining Cassandra, but I will say I have a little bit of I told you so from that about autonomous cars. Because everyone was saying how close that was, and those of us in this space were like, uh, no. And now stuff's getting pulled back. Governmental agencies are getting more involved.

The realities around this are becoming much clearer to people that the hype of how close the proposed future state was was very off from the reality of being able to have an autonomous system operate in an environment like that.

Daniel Bryant: Yes, definitely.

Courtney Nash: So for me, that's the analogy I give is, well actually the hype I hear about generative AI and all of that feels very similar. And it's not like, oh, it feels similar. It's founded on the same mental models and misconceptions about how automation and complex systems work. I think we're a long ways from that. So I mean, I guess go play around with it or put it in your product.

Daniel Bryant: Yes, plus one to that. Yes.

Courtney Nash: Have it recommend the wrong things to people and then have to spend how many people and product hours rolling it back versus, yes. Again here, where you say like, oh, should the average person use MTTR? And I'm like, no, jettison it. Should the average organization figure out automation at a foundational level first? Yes.

Daniel Bryant: Okay. Yes.

Courtney Nash: And this is where the learning from incident stuff is really helpful because your engineers, your incident responders can tell you where the pain is in introspecting these systems and these failures. And if you want to reduce your response time or your number of ... Really, you want to make it easier for them to respond to things, that's where you spend your time.

I was just talking to somebody else who's like, when they're on call, and they don't have anything better to do, they're trying to just make on-call better. So if they're not getting called ... This was Fred Hubert at Honeycomb. He's like, I am working on these scripts, or I'm working on these things. And so there are individual attempts at that, but if an organization wants to get better at managing incidents, give people better tools, make things easier.

Can you explain more about joint cognitive theory? [34:53]

Daniel Bryant: Yes, totally makes sense, Courtney. I think there are two ways we can go here now and dive into and perhaps break down some of the things you've just explained before we wrap up. I'm also curious about the joint cognitive systems thing you mentioned there. Because I read in the report, it says that 10 challenges for making automation a team player. And my background is actually as an academic as well, and I studied agent theory and things like that. I knew it was very hard and this was 15, 20 years ago. So there's a really nice breakout in the report that talks about 10 of these challenges, and did you want to dive into those a bit more, or do you think is that going to distract folks-

Courtney Nash: I'd love to.

Daniel Bryant: ... or what do you think?

Courtney Nash: No, no. I think this is great. I end here because it ends on all is not lost. Right?

Daniel Bryant: I like it.

Courtney Nash: All is not lost. Some of this sounds depressing and sad, but all is not lost. We have a wealth of human factors research on user interface people. We have the experts in the world. Let's bring them to bear to these problems.

So the joint cognitive systems work I cite is researcher, David Woods, who I think he's just professor emeritus now. He's mostly retired from the Ohio State University. And Dr. Laura McGuire, who I mentioned speaking with had studied under him. And this also comes out of, guess what? Aviation cockpits and surgical environments early on. Before the internet started doing bad things to people, we could harm people pretty clearly in airplanes or in medical settings. And so looking at how do you have what they started to call joint cognitive systems?

And so this is a funny one because we are now anthropomorphizing our computers in a way that I'm okay with. It's like I'm walking a fine line here. I think we are. But if we accept the reality that we are together in this, we use this term a lot... you use it, I use it... socio-technical systems. Right?

Daniel Bryant: Yes.

Courtney Nash: If that is the case, let's do that. Let's update our mental model from computers are better at, humans are better at, to we are a team trying to achieve this common goal. What's the best way to do that?

And they're in there in those 10 ways. There are some very specific things about what I talk about in terms of having better tooling in taking our UX experts and turning them to our internal tools-

Daniel Bryant: I like that.

Courtney Nash: ... of like, can you introspect it? Will it tell you its state? Will it tell you all of those kinds of things? Really, you could do some of this on the command line. It doesn't have to be slick GUIs or any of that. That's not what matters. What matters is can I know what you're doing? Can you know what I'm doing? You in that is the system.

My favorite pop culture analogy to this that John Allspaw uses all the time is Iron Man. Right?

Daniel Bryant: Yes.

Courtney Nash: That's the idealized version of a joint cognitive system. And it's all very movie-based-... but it's like instead... not Minority Report, Iron Man. Not these dashboards that supposedly one magical expert knows how to use, but a thing you can talk to that can tell you what's happening.

Daniel Bryant: Yes, yes. Jarvis, right, in the world of-

Courtney Nash: Jarvis.

Daniel Bryant: Yes, yes. Interesting. With an agent-like quality, to your point, guiding your attention, summarizing things for you rather than just like, oh-

Courtney Nash: We are both agents in this system. How would you design that system then? Let's think about it that way. So I like that because it's not an insurmountable task, but it's work and it's roles and people and moving our corporate foci to areas that we don't always feel like we need to invest in. And that's hard right now, especially when the economics of being a company are very weird. I get that.

Could you explain the premise of your QCon New York talk from last year, “Comparing Apples and Volkswagens”? [38:51]

Daniel Bryant: Yes. Fantastic, Courtney. Fantastic. Yes. Before we wrap up, I'd love to do a final call to participate, but I noticed that before we were chatting, I went on to InfoQ, and the talk from QCon New York, where we actually crossed paths last year is now available. The recording is online, so people can pop along. I'll put the link in the show notes, and it's a great title, Comparing Apples and Volkswagens: The Problem With Aggregate Incident Metrics. Can you just do an elevator pitch for why folks should watch that? Because it's a great title. I'd love to know more.

Courtney Nash: Well, first of all, I owe my mother for the title. My dear late mother was a sociologist and had an amazing way with words. And that notion came from when I was a kid, and I would be so upset about this, that or the other, comparing my life to someone else's or like, oh... and this is a human thing that never changes. And instead of saying apples to oranges, she'd say, well, that's like comparing apples to Volkswagens. We had a Volkswagen. And that was such a great metaphor for me as a kid. And she also was like, it may be your apple or your Volkswagen, it's still yours.

So yes, I told John about that and he was like, "Oh my God, you have to use that." I was like, okay. So anyhow, it is the notion of using metrics that don't tell you about the reality of your experience really. So that's where the title came from.

There are a couple of little pieces that are in that one that we didn't talk about that was in a previous report. I look at whether duration and severity are related. So if people are curious about that one, too, that one does come up. Like, are longer incidents worse? I put air quotes around worse for the listeners. The spoiler alert is no, there's literally no statistical correlation, but go watch it. Go find out more about that because I think we also incentivize unfortunate behaviors. I mentioned OKRs around SEV threes or whatever. Let's have two SEV ones. Hopefully, I would like to think that some of those kinds of metric-based analyses will help people who are struggling in this environment go to their people, their higher ups and be like, look, here's the data. Let's do this differently.

Daniel Bryant: Yes, yes.

Courtney Nash: That's really my goal with that, right, is to empower the people at the sharp end to have more meaningful and impactful conversations with people at what we call the blunt end, right?

Daniel Bryant: Yes.

Courtney Nash: So that's a great talk if you need a little bit of that potential ammunition, I would say.

How can listeners help you or contribute to the VOID report? [41:12]

Daniel Bryant: Fantastic, Courtney. I'll be sure to link that. Final wrap-up question, Courtney, is how can folks help you? And I think we've already touched on this, but I'd love to hear it again.

Courtney Nash: Yes. We have a really simple form to submit little tiny incidents. I just realized we redesigned the site recently, as one does, and that form has gotten buried, so I'm going to try to figure out how to resuscitate that a little bit. But if people have a large body of incidents that aren't in there because I'm a one-woman show on hoovering these things up, they could reach out to me on LinkedIn or through the site, there's a little contact form.

I would be happy to work with people to pull their data in if they want to do that. But as I mentioned, we are going to try to field a much bigger survey this year, and that's going to hopefully have a huge impact on the experiences and the effectiveness of incident responders, on-call folks, all of that. So it's not out. I wish I had now a link for you, but you can sign up for our newsletter though. Hey, and then you'll get all the details about when those kinds of things will be coming out. Thanks.

Daniel Bryant: Fantastic. Fantastic. Well, thank you very much for your time today, Courtney. So much there to cover. It was fantastic to overview the topic. I'll be sure to link all the reports and everything else. But thank you for your time.

Courtney Nash: Thank you so much. This was great. I appreciate it.

About the Author

Courtney Nash

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.