BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Ironies of A^2 I^2

The Ironies of A^2 I^2

45:16

Summary

J. Paul Reed discusses the "ironies of automation" - a 40 years-old concept now amplified by AI. He explains how advanced systems often make the human operator more crucial, not less, while simultaneously degrading the skills needed to intervene. Sharing real-world stories of "AI-fueled" incidents, he shares why over-reliance on AI can double recovery times and how to maintain resilience.

Bio

J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful boutique consulting firm, he now spends his days as a Staff Incident Operations Manager at Chime, focusing on incident response, analysis, and systemic risk identification. He's worked with such organizations as VMware, Mozilla, Symantec, and Netflix.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

J. Paul Reed: The program had a printing of the title of this talk, and so I actually wanted to explain it. The carets were supposed to be subscripts, so like the ironies of A squared, I squared, or AI squared, and that expands to the ironies of automation and artificial intelligence in incidents. We're going to be talking about automation and AI, and how they show up in incidents. As I said, what do we mean by ironies of automation? What are we going to talk about today? Ironies of automation, what do I mean by that? What does this have to do with AI? What do we mean by ironies of AI? I promised incident story time, so we will have incident story time. We're going to talk a little bit about ETTO. Then we'll end with, this is all maybe fascinating and interesting, now what do we do with all this stuff? This is for me, actually, I'm really curious, who is a developer? Who is Ops, SRE, Platform Eng? Who's on an on-call rotation? Who's on call right now?

The one thing I did want to point out is the Human Factors and System Safety Masters. The reason I bring that up, it's not about bragging. What we're going to be talking about is some of the stuff that we studied about human factors and system safety. One of the interesting things is there's more and more software people going through that program. My classmates were pilots, and nurses, and doctors, and air traffic controllers, and they laughed when I said I work in software. They're like, why is that a safety critical system? I think you all know why it can be a safety critical system. I think the AWS outage of a couple weeks when we saw parts of our world, parts of the entire internet fall apart. We all know where the safety comes in here. One of the interesting things here about is that we also focus on the human factors and the human aspect in these systems. The all of you, the all of us in these systems.

Ironies of Automation

The ironies of automation, what do we mean by that? I'm going to jump into what some of the identified ironies of automation are, and then we'll talk a little bit about where this came from, where this idea of these ironies came from. The first one is, manual skills deteriorate when they are not used. This seems intuitive. This idea that if I'm responsible for operating a system, but then maybe I introduce some automation and I don't use those skills, then they're going to deteriorate over time. You've probably all had this where you automate something and then in an incident, you have to come back to it, and like, I forget the arguments to this command or whatever it might be, and then you have to look it up. That's the first one.

Second one is the generation of new strategies requires an adequate knowledge of the system. When we talk about new strategies, what we're referring to is, again, in an incident context, or maybe even in developing new features, like new strategies, new ways of doing things. It requires that we have an adequate knowledge of the working system to be able to reason about and think about and propose these "new strategies". That was a second one of the ironic ironies of automation. This is a long quote, I'll read it and then I'll explain it. There is some concern that present generation of automated systems, which are monitored by former manual operators, are riding on their skills, which later generations of operators cannot be expected to have.

In our world, over time, one of the interesting things might be if you consider compilers as automation, who writes assembly anymore? Maybe some people do, but most of us don't. There's this idea that when we introduce automation into systems, we have to continue to cultivate the skillset for folks to monitor that automation. We have to have our junior engineers doing that, otherwise we're going to put them in situations that they're ill prepared for. Maybe if we move somewhere else or retire, we might have teams of folks that are newer to the automation and they may not have the familiarity that we had when we installed it, because they're used to it just working.

One of the ironies is of us telling folks just run the runbook. We tell that maybe, in an incident, just run the runbook, it's fine. One of the interesting things is we put folks into situations and we tell them to follow instructions, but the reason we put them in those situations is to offer expertise. In an incident, we're like, it's broken, do you know how to fix it? They're like the only thing I really know how to do is run the runbook. That in itself is an interesting irony. I love this one, this is a really critical point. Automation requires a speed versus correctness tradeoff.

As we automate things, and this applies to manufacturing, it applies to software, as we build, let's say, cars faster, we can't actually look over every single car that comes off the factory line to ensure it's 100% correct. This irony gets us into statistics and statistical correctness around the things that we're producing, whether it be software, whether it be widgets. As we increase the speed, we can only start to validate acceptability. We start to think about, it's not right or wrong, it's, is it in this range of acceptable performance, of acceptable widgets, of acceptable services where we deploy them in our software architectures? The key point here is we can't get both. We can't assert correctness and have high speed at the same time. This theme will come up again, and we'll revisit again, too, in this talk.

Automation can camouflage the current system's state. There are a lot of, actually, aviation examples where you've got an autopilot that is subtly correcting some problem with the aircraft. You've told the autopilot, I want you to fly in this straight line, and maybe there's a problem with the rudder, a mechanical problem. It will, to its limit of its tolerances, try to keep the plane flying straight, and it will increase the deflection of the rudder as much as it's programmed to until it gets beyond its limitation. There's usually a design limitation there. There are a number of accidents, including accidents that have killed people, where the autopilot clicks off because it's exceeded that limitation, and suddenly the plane violently goes one way or the other because the automation was masking that there was a problem.

The interesting thing for operators is, pilots, they went from a situation where everything looked normal to them, and suddenly it was way out of the tolerance, way into a dangerous situation. We can see this in our systems with autoscaling, or some of the automated remediations that we put in place to just scale things up. You probably have seen this. Autoscaling is one of my favorite ones because Amazon will scale these services for us, and maybe it doesn't have a reasoning about, that's going to create a thundering herd for a downstream service or something of that nature. This is an interesting key point as well. This is one of my favorite ones. Automatic systems should fail, obviously. My reaction to this is like, really? That's an irony. What's interesting about this is it's worth saying it explicitly because automatic systems sometimes fail in subtle ways that aren't obvious. That is actually in and of itself an irony that when we build automation, we don't always build complete ways for it to tell us where it is on a path to failure or how close it is to potentially failing.

This is super applicable today, tracing the decision trees made by algorithms can be difficult. This is the idea that when automation is doing what it's doing in the moment and we come up to the situation and are trying to make sense of it because something has happened or something has gone wrong, it's hard for us to figure out, how did the automation get from point A to we're at point T now? How did it get there? Because I need to know that so I can know how we can get it back to point A, which is maybe acceptable system performance. What's interesting about now, today this might be impossible. With AI, we may not be able to know what that decision tree was, may not even really be a tree of sorts. That can be very opaque. Also, the interesting thing about AI is the automation you hope is deterministic. There's like if statements.

In AI, that's not necessarily a guarantee. This leads us to an inability to fully understand the current system context when you are paged. For folks that have been on call in an incident, have run into this problem where you get paged and you have a bunch of questions because our various forms of automation in our cloud providers, in our systems that we've designed have already masked or tried to solve whatever situation is happening before you got there. You have to make sense of that in addition to what is the actual real problem. Have you ever asked yourself or even said one of these things in an incident? This is from a study of pilots and doctors. I don't think anybody wants a pilot saying, how do I stop it from doing this?

One of my favorite ones here is, I know there is some way to get it to do what I want. There's that hope there. I know it can do it. I've seen it do it before. What is this automation doing? All of these ironies of automation come from a paper by a woman named Bainbridge. The thesis of the paper here is the paper suggests that the increased interest in human factors among engineers reflect the irony that the more advanced control system is, the more crucial may be the contribution of the human operator. If you look real closely, you can see, but in your mind, when do you think this research was done? How many years ago? Forty years ago, yes, that's right. This is 1983. All of those things in the last walkthrough of the ironies of automation, if you identified with them, you're like, I felt that way in an incident, we've been trying to think about this problem for literally 40 years.

I want to talk a little bit about joint cognitive systems because the terminology we use will come back and it will be useful for our discussion here. When we talk about a joint cognitive system, what we're talking about is a system that has usually some boundary of some sort and there are agents and actors within that system. When we talk about operators, us humans doing things, there are some things that they have identified that we can do within a joint cognitive system. For the purposes of this, let's just keep it to, it's us people working together in incidents. It's human to human interactions in this. Autonomy, what do we mean by autonomy? Autonomy means, very simply, I can do a thing. I can work on my own to accomplish a goal. Again, with a two-party system, we're autonomous. Authority, that means I have authority within the system to do a thing. It's not like a command authority. It's more like I can pull this lever or I can turn this switch, or I can type these commands. Directed attention. Humans have directed attention.

Right now, your attention, hopefully I'm interesting enough, it's directed at me. We have our attention generally focused on one thing and there's a lot of research on how that works in the brain. Because we have a directed attention, we also have redirectability. That means if I see someone and I need their help, I can be like, "Lorin, I need your help. Whatever you're focusing on, can you come focus with me on this thing?" Then there's also this idea of interpredictability. Really high-performing teams have really good interpredictability. The quintessential example of this is if you have a really good friend and you finish each other's sentences, and you know what you both meant and you don't have to actually correct what you thought you meant, you have very high interpredictability. Really good SRE teams and incident response teams have really high interpredictability because I can predict what I think you're going to do in an incident and I can be with high confidence, 80%, 90% correct that you will do that.

Let's look at this from an automation perspective and maybe an AI perspective as well. Automation certainly has autonomy. It can do things on its own without us telling it to do it. That's why we build it. It certainly has authority. We need to give it authority to turn switches and deploy things and run code because, otherwise, why would we have it? Automation doesn't have directed attention. I can't ask the automation, what are you paying attention to right now? Because I can't do that, it also means I can't stop the automation and say, this thing over here is on fire, I need you to pay attention. You can't do that with automation.

Then, of course, interpredictability. This is why when automation goes awry, it is violating any interpredictability we thought we had. What's important about this is these things, these last three things, directed attention, redirectability, interpredictability are the foundations of coordination. When we talk about, I'm coordinating with my team to resolve an incident, this is where all of that cognitive work is happening. Dave Woods is a researcher. He came up through Three Mile Island. He researched that incident. He's been doing this a long time. He'll make a couple of appearances in this presentation. I really like this quote, "Technologists often mistake connectivity, the capability to connect disparate parties and data sources for coordination". Then I'll come back, again, this distinction of connectivity versus coordination, they are not the same things.

I want to walk us through what's called the animacy paradox because it's interesting for our discussion, but also it may resonate when we talk about automation, why our sense of it gets changed in situations, why it's context dependent. Automated systems, as they increase in autonomy and authority, have two kinds of interpretations. We interpret them in two different ways. One is as a deterministic machine. I wrote this script. I know what it's going to do. I've run it a ton of times. I've maybe had some bug fixes. I can predict what it's going to do. It's deterministic. The other interpretation is as an animate agent capable of activities independent of the operator. What is the difference between these two interpretations? One of the interpretations is in context.

This animate machine, this automation that is doing something, and I don't know what it's doing, why won't it stop? Why can't I redirect it and make it do something else? Where is it in this list of steps that it's doing those things? Those questions come up in the context of things like incidents, things like high-tempo, high-consequence situations. I'm sure you've all been here, when we go and do the IR, and we find out, of course the script does this, and it's going to call that. It's not high-tempo, it's not high-consequence. We're thinking about it in hindsight. Yes, of course, it's a deterministic machine. I got it now. It's not a weirdo animate agent that can do things on its own. I gave you the punchline, but the difference between in context and in hindsight in this example are these high-tempo, high-consequence incident situations. In incidents is where things are already broken, we're already stressed. We're trying to get a handle on this situation. These agents, whether they be automation or AI, often can seem like they're just running around, doing things, and I don't know what they're doing. I just want them to stop so I can get a handle on the situation. That was Bainbridge, 1983. Long time ago, ironies of automation.

Ironies of Artificial Intelligence

Ironies of artificial intelligence. All this AI stuff, you knew somebody was going to have to talk about this. Mica Endsley is another researcher. She did a lot of work on situational awareness. If you know what situational awareness is, she did a lot of the foundational work on, what does that mean? What does it look like? Should see, 40 years later, this is a very recent paper, wrote up the Ironies of AI. The thesis here in this paper is AI can be considered an advanced and potentially more capable form of automation. Its reliance on learning algorithms also creates additional challenges for the people who interact with these systems. Let's talk about the ironies that she uncovered. Artificial intelligence is still not that intelligent.

At least in the research literature, an intelligent system is defined as one that can recognize situations, adapts to changes, generates solutions to even novel problems, and can act to optimize performance. Machine learning is excellent for data analysis when there's a large enough dataset, but where it falls short is the part about novel situations. Predictive text, predictive models predict stuff they've seen before. That's what they're doing. Oftentimes, we're in a situation where we're seeing novel situations. We might be relying on AI for answers. There's an irony there. Also, it was pointed out that ML-based AI systems lack causal models. Again, they're predicting things they've seen before.

The expansiveness of their model of causation in the systems, they don't always have. We have that because why? We've been in a lot of incidents. We know that the foo service that sits in an entirely different AWS region can actually affect the bar service in its entirely different AWS region. The more intelligent and adaptive the AI, the less able people are to understand the system. I know I've struggled with AI. Because there aren't distinct logical rules or they're not apparent, there's this opaqueness to AI. Part of good incident response is having a mental model of the system, and a lot of these models of how AI work are opaque. You can see this when it gives you an answer, again, as an interpredictable, that you would have thought, I know what answer it's going to give me, and it gives me a completely different answer, which also might be wrong.

The more capable the AI, the poorer people's self-adaptive behaviors for compensating for shortcomings. What's interesting is this is a situational awareness argument. A great example, you see it a lot more now, is like Tesla self-driving. The more capable, let's say, Tesla self-driving is, the more distracted we get, the less we pay attention and the less situational awareness we have. It makes it harder to know when I actually need to step in and correct the AI, because it is doing something that is dangerous. The interesting thing here is that the more capable the AI, the more problematic this is.

The more intelligent the AI, the more obscure it is, and the less able people are to determine its limitations and biases, and therefore, when to use it. We'll talk more about this a little later in some examples, but really, this is about bias. We've seen this, if you follow a lot of AI stuff, AI models train on data, data can be biased. The more that it's doing that, it can make it harder to signal to us, when should I question it more? Or when should I say, actually, this is not a good use case for AI? I think this one is one of my favorite ones. The more natural the AI communications, the more natural the way we communicate with AI, the less able people are to understand the trustworthiness of the AI.

My favorite example, this is from my life, this became like a meme, where the AI, you'll tell it it's wrong, and it's like, you're absolutely right. It's so excited that it's wrong. I don't know about you, but when I'm talking to a human, and I am wrong, and I know that I'm wrong, I feel a little bad about it, and suddenly, I want to have a conversation with that person about, let's talk more about the bit of data that you have that was right. AI, whether it's right or wrong, it's super confident. I know you can change that these days, there's prompts to do that, but. One of the things I did want to point out, typically, people rely on a number of cues to determine how much confidence it's placed in information. In incidents, when an engineer says, I think we should do this, that's very different than I think we should do this. When we're reading AI, in either its stone-cold voice or text, we are inferring the tone, humans just do that. We lose that bit of information about how confident we should be about AI.

Incident Story

Incident story time. I do need to point out, these are crowdsourced, so if you're thinking, these are all from incidents that I personally worked, you're wrong. In fact, the LFI, learning from incidents folks, we all get together and talk about all the incidents you all have, and spill the tea, as it were. These are some of the AI-related incidents from various sources. I love this one, this is a Claude one. The instruction was, can you add a localization for this string? The red lines you see are the point to pay attention to. Confirm with me before making the phrase change. The AI said, great, I'm going to use the Nabu localization agent to go ahead and do that. This is an example of Claude agent, and then subagent doing a thing in my codebase. As the operator, I told Claude, I need you to do a thing, but confirm with me before you do the thing. Claude said, got it, great, going to go do the thing. The thing that I need to do is the subagent, so I'm going to do that.

Then, Claude stands in for me, after the agent does the work and says, yes, it looks perfect, go ahead and commit it. I told the AI to check with me first, and Claude said, I hear you, but I'm going to spin up a bunch of agents, and I'll tell them it's fine, for you. What's interesting to me about this is, humans generally would not do that. If we're together working on a problem, and like, nobody touch the button, generally one of us is not going to be, "He said not touch a button, but you can go touch the button, it's fine". This caused an incident. These are in order. This is like a pre-incident. This is in an incident. The context here is we had two deployed services serving same functionality we're kind of migrating, in different language. They were implemented in different languages. Both were deployed. One was in Java. One was in Go.

The Go service didn't implement a corner case, so they left some data untagged, and that was causing some problems. AI was asked to go port the relevant Java code to Go. Not the whole thing. It's like, here's the feature, can you find it? Yes, great. Take this, port it to Go. Very constrained problem. I would have thought, great. AI should be great at this. Narrator, the AI implemented code prompted another incident. They took the code, AI checked it in, deployed it in Go. It caused another incident. The AI implemented code did not contain unit tests because they forgot to ask for that. It was difficult for the responders to code review, and they were using AI specifically because they didn't know Go. The person who had wrote that service was long gone. That team owning the service had very little Go knowledge.

As I mentioned, caused another distinct incident, which necessitated additional cleanup work. Likely increased the length of the incident by two to three times. This was a multi-day incident. This is my commentary. Likely, this is a counterfactual. I can't prove that. What I can tell you is that the service got deployed two or three times, and then there was additional remediation work that took days because of the problems that were created by deploying the AI code. This is an example of where I would have thought, take this little snippet in Java and port it to Go. Perfect, but maybe not.

Incident the third. This is a general class around summaries. This is Slack, and the blocked out parts are names. So and so, just to confirm, both you and so and so are AFK and should be back by 7:40. This is in an incident, Slack channel. The person didn't ask for this. AI just barges in. I searched for your company's knowledge and shared a suggestion with so-and-so privately in this thread. Do you want to know what the suggestion was? Yes, both name and name are AFK and should be back soon. Thank you, AI. That was so useful.

I bring up this particular example because we have now AI agents. It's not like somebody said, let's have the AI agents come into our incident Slack channels. It was a company that was like, we have one of those tools that just gobbles everything, Confluence, Jira, Slack, everything. It just popped into an incident with like, let me be really unhelpful and also distract you. That's interesting. Maybe folks have had that experience. Again, this goes back to, in an AI context, directed attention, redirectability. My favorite one in AI is like redirectability. How many times have you said, AI, you have it wrong? It's like, ok, cool. Then it spits out a wronger suggestion. You're still wrong. Now, sometimes, if I'm really bored, I'll just keep finding ways to tell it it's wrong, just to see what it comes up with because I'm bored.

ETTO

Let's talk a little bit about ETTO. What is ETTO? ETTO is a term coined by Dr. Erik Hollnagel in 2009. It stands for the Efficiency-Thoroughness Trade-Off. This is a key point, people, individuals, and organizations, as part of their activities, have to make tradeoffs between the resources, primarily time and effort they spend on preparing to do something, and the resources, primarily time and effort they spend on actually doing it. The tradeoff may favor thoroughness over efficiency if safety and quality are dominant concerns, and efficiency over thoroughness if throughput and output are the dominant concerns. I think the key point here, that you can't maximize efficiency and thoroughness at the same time. It's always a tradeoff. You have to have a minimum of both to get something done that is of use.

The point here that I want to make is responders who are tackling an incident, the bet has already been made for them on efficiency, whether they made it themselves or the org made it themselves. Because they're in an incident, they already lost the bet. The money is gone. You lost it. When you go into an incident and your first thought is, I bet AI can solve this problem, or automation can solve this problem, you are betting on an efficiency again, and you already lost that bet. That's something to think about.

Now What?

Now what? You might be thinking, Paul is an AI hater. I am not an AI hater. I want to be very clear about that. This is not a talk about AI is horrible and we should never use it. The current fundamental question I think we have to ask ourselves in the context of incidents is, given where we are now, what is AI's fitness for purpose in an incident? A couple thoughts about how we can reason about this. I wanted to talk about this. This is very recent. You'll look at the date, July 15, 2025. You'll see Dave Wood's name with some other researchers. They did a really interesting study we're going to talk about, but how AI can degrade human performance in high-stakes settings in incidents. They did a different study, where they asked developers, did you feel more productive?

They basically found that developers took 19% longer to complete a task with AI, even though they thought they had completed it 20% faster. What's interesting about this is this study was actually with nurses. What they found is a little more nuanced. What they found is when AI predictions were most correct, this was a study of nurses, nurses performed 53% to 67% better than when they worked without AI. Great. That's good. However, when the AI predictions were most misleading, those nurses performed at 96% to 120% worse than when they worked without AI assistance. The punchline here is when it's good, it's a little good. It's going to be helpful. It's moving the needle. Sure, 53% is great. That's great. When it's bad, it can be really bad. Let's deal with the automation ironies.

One of the things we have to do is engage in practices that cultivate the ability to buy time. What do I mean by buy time? In an incident, we want to have strategies that give us more maneuverability, give us more time to think about the problem for us to be thorough. How do we do this? Simulating, simulations, chaos engineering game days, those sorts of things. That allows us to practice strategies in incidents that we can have ready at our recall in an incident context. Widen system understanding. This is the whole bit. You've heard mental models mentioned a lot. The individuals and the teams that are the best at incident response have very broad mental models of the system that they work on.

I think one of the siren songs of AI is, the AI has the model, so I don't have to keep it in my head. Or worse for junior engineers, I'll look at the AI, so I don't even have to build a mental model. That's actually false. We still need those mental models because we need to be able to tell the AI when its mental model is actually wrong, and it's hallucinating. For an automation perspective, consider the element of time pressures when designing automation. There's a few ways to do this. Looking at ways that you can actually slow the automation down in timing sensitive situations, that buys you time. You usually wouldn't want to do that, but there are cases where you might.

Let's talk about the AI ironies, how we deal with those. Support for human interaction and oversight. These are really nascent right now. If we have agents running around in our systems. Often, it's really hard to get oversight on what those agents might be doing, and especially in an incident context. The interaction models are still pretty nascent, too. It's all text-based, prompt-based. What I love, too, is when I'm like, I'm doing the thing with AI like you're wanting me to, and I'm like, it's not working. They're like, the problem is you. The prompt is wrong. I'm like, ok. There's some work for us to do there. Attribution.

AI systems must be explicitly identified as AI system and as bots. In incidents, I need to know if the information is coming from a human or an AI. Not because one is better or worse. I just need to know that so I can understand what I'm dealing with as an incident coordinator, which is my day job. Explainability. AI systems must be equipped with explainability features that allow people who interact with it to understand the AI's capabilities and its limitations. This is what I was talking about before, training and skill retention. The number of times people have popped into incidents and just started using AI, it's interesting. It's fine. That's probably not the best time to be practicing your prompt engineering when you might be losing $1,000 a minute. Those aren't hypothetical things. Again, lots of juicy incidents in the LFI Slack.

From a resilience engineering perspective, ETTO, Efficiency-Thoroughness Trade-Off, TETO is Thoroughness-Efficiency Trade-Off. The thing that makes us over time generally more efficient is the fact that there is a thoroughness component at some point. It's that scale that goes back and forth. Because we're different actors, it means we can actually help the AI be the efficiency arm of that scale while we use the extra time to become more thorough so that when the thoroughness becomes a more important aspect of what we're doing, we have that capability as human operators to respond and to anticipate. The last bit I want to talk about with dealing with the AI ironies is joint testing of human AI systems. This is a new area. A lot of the human factors folks are focusing on this joint activity testing, is actually what it's called. I'm going to show you a really interesting graph. This is from that study with the nurses. Basically, on the x-axis is the magnitude of AI error. More right is more error.

Then the y-axis is complexity of the task. The three columns there, the first column is AI recommendation only. The second column is AI recommendation and explanation between the AI and the human operator. The third column is the AI explanation only. If you look at this, what you will see is the outcomes are better when the AI provides the AI explanation only. When it is used as an agent to gather data, when you use it to redirect your own attention, but when you rely on it for a recommendation, and especially when you rely on it for just the recommendation and just the conclusion, those are the graphs on the left where you see that dip. On the right, that third column is where you're getting the explanation only. As things get more complicated, the performance lines don't diverge as much.

Key Takeaway

I talked about a lot of things. If you get one takeaway from this presentation as an incident coordinator, this would be my ask to you. Please help your friendly incident coordinator out. If you're going to use AI in an incident, please let them know. Please let them know that you're using it and what you're using it for. The reason for that is some of these incidents that took a long time, took days to resolve because the team was coding the solution with AI. I can guarantee you that the incident commander in that situation would have marshaled additional resources, probably would have found someone who knew Go to solve that problem if they had known the team was using AI. If you use AI in incidents, please let your neighborhood friendly incident commander know.

Questions and Answers

Participant 1: I would summarize most of the ironies you listed up, or in fact, your whole presentation in, the more you use AI, the more your skills erode, and the more capable you become of judging its output and the less you understand what you're actually doing. Is that a good summary? Is there research about how to incentivize people to remediate that problem without having to do the actual grunt work again manually? Because that would defeat the purpose of using AI in the first place.

J. Paul Reed: In terms of a summary of the presentation, the broader takeaway than the one I just gave you is that there's a lot of what I would call unbridled enthusiasm for AI. There are a lot of cases where I think it doesn't particularly perform very well, but we get a big dopamine hit thinking it does. If we're coding, and we're vibe coding, and we're doing a weekend project, it's like polishing your car, or whatever. That's enjoyable. In the space that incident operations, incident coordinators work, that's not the time and space to be doing it. I hope there's a conversation around like what is the best place to be using AI and not. Also, in a professional context, what do we do to set ourselves up as we use it for success when we can't use it or shouldn't use it? You've probably all heard like code reviews of AI code. The code generation part is much faster. Then the review part becomes more important, not just for correctness, so that if it goes wrong two weeks later in an incident, I as a human have a mental model of what was going on and what it was generally doing. From a summarization standpoint, that's how I would summarize it.

Participant 1: How do we get people to actually keep those skills and that understanding if they have no incentive of actually keeping it because it's all delegated to AI?

J. Paul Reed: That's an interesting thing. You're asking an organizational dynamics question, which is, how do we incentivize engineers to care about a thing that we're also telling them they don't need to care about anymore. This is where I would actually argue that you're talking about incentives, but when we talk about the expertise of engineers, you're making those Efficiency-Thoroughness Trade-Offs all the time, whether or not you know it, where it gets interesting is when it's under pressure. That's where we use our expertise of our experience of like, that didn't work out very well, to actually have a conversation about what we're being incentivized to do. The running joke right now is code generation, like right now in my life, AI's main impact on my life is more incidents, because we're generating more code and we're struggling with these problems right now. That's what I say is the expertise, I think, of the human folks that get called into the burning buildings when the software stack is on fire is how we're going to change those incentives, or at least have more nuanced conversations about them.

Participant 2: I've seen in my organization that some postmortems and/or incident channels have AI summaries. Do you have stories of where that goes wrong and how people can still retain that exercise of going ahead and thoroughly adding details to postmortems or incident channels?

J. Paul Reed: I've had this conversation a number of times. I had it here at QCon actually. They want to use AI to generate the retro, the postmortem doc. They also want to use AI, like I come into Slack, I don't want to read the whole thing, AI summary. The major problems with that, and Lorin, and Molly, and I have a lot of deep incident analysis experience. When you do deep incident analysis, the AI is summarizing what it can see. Incident analysis is about finding what you didn't see and what is invisible in what's going on. I actually took a pretty strong stance because I was having this conversation and part of my job as an incident coordinator is to do the incident reviews and the incident write-ups and the incident analysis. They were like, yes, AI all that. I was trying to explain, there's nuances.

I have great incident stories where it's like the incident occurred not because somebody went on vacation, but because they came back from vacation. AI is not going to tell you that. AI can't figure that out. What I eventually got to in this conversation was the person was, I don't actually care about incident reviews. They're all just work I don't want to have to do anyway. I just want to get back to coding. We can talk about that. What I see when people are like, "I just want to use AI summaries to do the incident reviews", it's because they either have read a lot of just really not great incident reviews, where somebody is rushed for time and they're summarizing anyway.

Of course, AI can do that better, but you're not going to get insights. You're not going to get learning because, again, a summarization is what AI thinks is important by frequency of how often it came up. That's not insight. Insight is different. Learning is different. Also, AI can't make the argument to your CTO that you should invest in a technology or divest of a technology. I actually said, if you're doing all of your incident reviews as just summarize in AI and then put in Confluence and move on, don't do them. No one's going to read them. You didn't write them. Nobody cares. If nobody cares, why do the work? Folks that do LFI and do those in organizations, they will have very great conversations about why you do that work because they'll have the best stories about an incident that I guarantee you AI will never ever catch, because it doesn't know to ask it. It doesn't know that that's relevant or interesting, because that's all aesthetics and art, really.

 

See more presentations with transcripts

 

Recorded at:

May 21, 2026

BT