InfoQ Homepage Presentations AI-Powered SRE for Autonomous Incident Response

AI-Powered SRE for Autonomous Incident Response

View Presentation

Speed:

01:00:24

Summary

The presenters discuss incident response, how AI-enhanced SRE platforms connect signals from logs, metrics, traces, and historical incidents to enable autonomous decisions.

Bio

Rohit Dhawan is Engineering Manager @Amazon. Pavan Madduri is Senior Cloud Platform Engineer @Grainger and "Kubestronaut". Alina Astapovich is Platform Engineer @Storytel. Goutham Rao is CEO and co-founder of NeuBird, the creators of Hawkeye, the world’s first generative AI-powered ITOps engineer. Renato Losio is an InfoQ Editor.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Renato Losio: In this session today, we are going to chat about AI and how AI is changing site reliability engineering.

What's the topic today? AI is changing our DevOps and SRE practices by moving teams beyond reactive monitoring, what we have done for many years, towards more predictive automated delivery and operation. There are many questions that as a DevOps engineer myself, I have, like, how can a team implement incident response? How can we do better alarm correlation? How can we fix problems in production before our users are affected?

My name is Renato Losio. I'm a cloud architect, cloud practitioner. I'm an editor here at InfoQ. I'm going to be joined by four experts coming from different companies, different sectors, different backgrounds. They will discuss how AI agents and generative models are going to be used for incident detection, root cause analysis, and remediation. Basically, how we're going to make your end user happy when things go wrong. I would like to give each one of them the chance to introduce themselves, share their professional journey, and why they think that today's topic matters.

Rohit Dhawan: My name is Rohit Dhawan. I'm an engineering manager at Amazon. I've been working there for close to a decade now. During my journey there, I've worked on helping Amazon launch into new marketplaces, launching new payment methods, processors, and building some of the core platforms here. For example, like charge platform and settlements and seller disbursements platform. Currently, I lead a worldwide payment processing pipeline here, which is critical to customer trust, process billions of dollars as well. In terms of my site reliability engineering experience, since in Amazon we own our infrastructure as well. We own business logic, we own infrastructure. That plays a crucial role from a DevOps standpoint as well. Apart from professional experience, I've been active in research areas as well. I have a few papers published in IEEE and Frontier in AI, DevOps, and software sustainability.

Alina Astapovich: My name is Alina Astapovich. I'm a platform engineer at Storytel. We are one of the world's largest audiobook and eBook streaming services. For the past few years, I've spent a lot of time building different internal platforms and tools, including AI-powered ones, to make working with infrastructure easier for developers. I originally came from a Java backend background, so I know firsthand where developers struggle. I try to build solutions that actually make their day-to-day work better.

Pavan Madduri: My name is Pavan Madduri. I'm a Senior Cloud Platform Engineer at Grainger, based in Chicago. I work heavily on multi-cluster Kubernetes environments. I'm also a CNCF Golden Kubestronaut. I also work on various open-source projects like Volcano, KEDA, and Dragonfly. I work heavily on these pneumatic topologies, how the hardware works for the AI agents and all the stuff. That's my major area that I work for. Apart from that, I also do various publications, like IEEE research papers. Even I work as an author for VKTR, which is an AI magazine, and other magazines, like DevOps.com. Then, of course, my favorite, CNCF, where I used to publish all this stuff.

Goutham Rao: My name is Goutham Rao. I go by Gou. I'm one of the co-founders here at NeuBird AI. Just a quick background on myself. I did my graduate school at UPenn, focused on AI, data science. Fast forward to today, what we're doing at NeuBird AI is building a context engineering and a data enrichment platform using AI for people building agentic systems. We focus on production IT operations. The SRE space intersects with what we're doing. We focus on IT telemetry, basically metrics, alerts, logs, traces, and use AI to do advanced context engineering so that anybody that is building an agentic system or using an agent like Claude Desktop or even NeuBird's own agent can get superior and highly accurate results. We also focus on predictive operations. Using our AI, we're able to predict when problems can go wrong and what the remediation should be.

The First Real Problem AI Should Solve, in DevOps

Renato Losio: I'd like to start with really addressing real problems. I was thinking, if I had to ask you, what is the very first real problem AI should help with in DevOps today? Because we hear many different stories, many options. I even mentioned before many areas where DevOps can get benefit from. We'll discuss them. If I have to point the very first problem that AI should help with DevOps, what would you suggest?

Rohit Dhawan: Just to start with, I would say the cognitive overload or the information overloading is one aspect, is what I believe. As a service owner, we manage a large number of leads and we receive a high number of ticket volumes. Just to start with, I think the key thing is like, tickets generally go from multiple queues. It can be a fresh ticket or it can already have 40, 50 comments which have already passed through multiple teams and whatnot. I think the key thing is, how do we summarize that information so that it is easier for on-calls and everyone who's reading to understand what all has been done, what all has been tried, and what all are the next steps. I think that is one thing that we started with as well because that's where AI power came from as well, just summarizing and doing all of those things. I think actually it also benefits in terms of saving time for on-calls because, after their on-call, they try to present these things in the on-call handovers, or leaders need to know in the business reviews and whatnot. It basically helps out there in terms of bringing that information in a concise manner, and it becomes easier in that way. In this way, you can save at least 15, 20 minutes and know what is going on. That is the very first problem. There are definitely a lot of things in the channel that comes along.

Renato Losio: Do you actually agree? Is that for you as well, information overloading, the very first problem that AI could solve for us full DevOps?

Alina Astapovich: We have so many problems that AI should help us with. It's really hard to pick only one. For example, today, my main problem was writing a huge document. Thanks, AI, for helping me with documentation. Tomorrow it might be something else. When I was thinking about this question, I had two things in mind. First one is one of the things why we're here is investigation of incidents. Because there you need to act fast and you need to act precisely. You don't have a lot of time to hesitate or to have second thoughts, especially if this is a client-facing service. This is where AI definitely should get into. Here we also should be careful and really trust AI in those terms. I think this is like the main problem and the main part where AI would really be helpful.

Goutham Rao: I agree with Alina's comment that, unlike in other domains where you're using AI for creativity, for instance, where you're using AI for making music or art, the outcome is subjective. Here, with production operations, if you're using AI and it leads you down the wrong path, especially like Alina said, if you're on a client-facing issue and you start wasting time going down the wrong path, then you start mistrusting AI and it just becomes a bubble. For me, there's two things where AI can really help as it relates to DevOps and IT operations. These modern systems, cloud-native systems, generate a lot of telemetry. There's just a lot of logs and metrics and traces. It becomes very quickly a quantity problem, a scale problem. Being able to use AI to surgically extract what is important for the issue at hand, that's problem number one in my mind. The other thing is around workflow automation. I think to what Rohit said, there's just a lot of things that engineers have to do to connect the various dots. I think Alina said she was busy writing a big document today. Where AI can also help people with is getting their time back so they can focus on the essential tasks, not the busy work that people have to unfortunately do.

AI's Role Where Human Attention is Wasted

Renato Losio: I wanted to address a slightly different point. We just said that it can help, and it can actually help us addressing things that we don't have the time to do sometimes or do it better than us. I was wondering, what do you see as DevOps, the area where human attention gets mostly wasted, where as a human being, we could see there that AI part contributing more.

Goutham Rao: In my mind, it's not that AI can do something better than humans. I don't think it can. It's just that humans, after some time, if you're doing the same body of work again and again, we get tired while GPUs don't. We'll end up making mistakes. We maybe won't look at a log line that we should have looked at. It's just too much work that we need to do where people lean more toward creative tasks as opposed to repetitive, mundane tasks. This is where AI can help. It can do that job without complaining day and night, again and again.

Renato Losio: We hallucinate too at the end. Do you agree with Gou? Do you have any particular things you see where human attention is actually wasted today as a DevOps?

Pavan Madduri: I truly agree with Gou. Because we term as like an observability tax. Something like human attention is mostly wasted on finding the data instead of acting on it. The first problem AI must solve is filtering the alert to noise ratio, so that if AI can tag alerts like actionable versus informational, so you can instantly reduce the operator fatigue before even you talk about the auto-remediation. That helps faster for the human to immediately act on what it has to act on immediately. That background AI can get all the gathered information and make it ready for you in order to act it.

Incident Management, with AI

Renato Losio: I was thinking when Gou mentioned that we all get tired. One of the things that we tend to discuss in this roundtable in any article I read recently about DevOps, I think any time we discuss incidents is we start with that scenario, when it's 2 a.m. or 3 a.m. Usually those are the two times where we put in the article, it's like at 3 a.m. something happened, the page rings, whatever, something happens. Something breaks and we tend to say that's the worst time when it can happen. That's true. That's now we think we had AI to help us. How can AI help there? Also, which part can a machine basically handle for you, or where do you instead still need the human making the call? We mentioned before there as well the logs, the tracing, probably you wake up, your inbox is full of notifications, your Slack is full of notifications, and I'm sure some automatic process can help you there. I wonder, what can we do there?

Alina Astapovich: When I'm thinking about incidents starting at 3 a.m., first thing that comes into my mind is like, who will hallucinate first, me or AI? I think from my experience, as we already mentioned a couple of times, the most important part in the beginning of the investigation is to find the right path, to get the right thing where we are starting to look into. I think, if there would be nice AI or any tool that can help me to summarize right, I'm waking up, what is going on? I need to solve this issue. Where should I start? Summarizing of tracing and logs. Also, here with any tool that is not including a human, it's really important to make sure that this AI tool is aware about your infrastructure layout. Because, for example, it can start to think that there is a network outage, or some cluster is dead, but actually you don't even have this cluster. Maybe you already killed it three years ago, but it's still hanging, it's mentioned somewhere in the documentation. There is so many things that we should be careful as humans before actually going with AI-suggested solutions. Just mind-blowing. Do not push anything so broad without reviewing first, even at 3 a.m., even if you are hallucinating. Probably you will be less hallucinated than AI. This is something that I would really go for.

Goutham Rao: In my mind, how I would summarize what Alina is saying, at least this is how we view it at NeuBird, is that AI falls in two buckets. There are these large language models that have a tremendous amount of reasoning capabilities in them, but ultimately, they're only as good as the context that they're prepositioned with. I use this as an example. You could go to the best doctor in the world, and if you don't describe what your problem is, you're not going to have a good outcome. You're just going to waste each other's time. If you go in and say, it hurts exactly here when I touch my shoulder exactly this way, and you provide a lot more context, you're going to get a better outcome. Where I'm going with this is, AI is two parts.

One is the models that do the reasoning, and the second part is data engineering, context engineering, data science engineering, being able to eliminate the noise, because there's a lot of data, and being able to extract exactly the right information that you need. I think Alina mentioned logs and traces. These two things alone contribute to a lot of information, but then you have metrics. You have to look at source code. You have to look at, how was my infrastructure provisioned? Can I look at a Terraform script? There's too much information to look at. You need to have a data platform that is working alongside AI, that is making sure that the AI won't do the wrong thing, won't come up with the wrong information. Why? Because it's picking out exactly the right information that's needed for the problem at hand.

A Company's Knowledge Base, and Successful Agentic Solutions

Renato Losio: Actually, I feel the same pain when I'm using, because I'm actually on the AWS space, and I feel the same way. The question is, looking at the logs and configuration, that's obviously already done by, for example, AWS and DevOps agent, but checking company knowledge base is not that easy, Jira, Confluence, Git, whatever. There are a lot of issues there that I'm facing. Yes, I feel the pain. I have the same feeling that, if I'm using a platform, it's very easy to have recommendation regarding that specific platform, whatever it is, AWS, Azure, or whatever else. Much harder to get those recommendations at 3 a.m. regarding my own specific bits that are not part of the platform. I don't know who has some advice, who wants to tackle it. Do you have any suggestion, any tip for Gregory?

Pavan Madduri: Currently, how we are handling is like, so the AI agent, it's not like AI agent, we call it agentic ops, because that's where multiple AI agents talk to each other to share the information. For example, I have an AI agent that continuously monitors our Datadog logs, and there's another agent that continuously looks at my KEDA scaling stuff. These two coordinate the stuff, and then it gets all the heavy lifting for me, like correlating the metrics, and drafting my YAML file, and create a pull request for me, make it as commit ready, and so that whoever is in the on-call incident, so engineer, he just wakes up and then makes sure the PR is good and valid, and he approves it. Here, one becomes the approver, not just an operator.

Goutham Rao: I'd add to, I think, Gregory's comment that to have a successful agentic solution, you need to provide all of the sources that you as a human engineer would operate with. You need to provide all the tools. If you only provide a partial set, you're masking the information, and you're not going to get a good outcome. In your environment, I think whoever asked the question mentioned Confluence. Your information lives in multiple places. It's not just the two things that the AI SRE agent can see. It's not just CloudWatch data. It's not just CloudTrail. You have to look at source code. Maybe your logs are going to Elasticsearch or Datadog, at the end of the day, your information is going to sit in a wide variety of places, including live information. People run Kubernetes, and kubectl can expose different information that sometimes is not being logged. You have to use all the tools that you as a human engineer would do, and your data and context platform needs to be able to access all of those things.

Rohit Dhawan: I would say the one thing apart from the log correlation and everything, and I think the knowledge bases we've also talked about. Where we are moving as well as executing some of the SOPs, there are a lot of tickets which keep on coming in. Let's say at 3 a.m., it's already the same kind of incident which has already happened. Then, how do you ensure that you don't keep on wasting the same energy that you keep on wasting? How do you figure out once you delete all the logs, what action you can take, next steps it can take. Also, it depends on what kind of boundary you want to give to the AI. Don't just do over-automation as well, which can hurt your production customers and whatnot. I think those are the boundaries as well. I think what Goutham, Alina, and Pavan mentioned, I think about the context, like getting all the data sources, be able to figure those things out well. That is very helpful for those getting up at night and trying to take a look at the request when they're already half asleep.

AI Tools and Agents

Renato Losio: I have the feeling that there are days when it feels that everyone is already using AI in every single form to do DevOps. If you go and look at the different deployment URIs, there's really little there. One of the questions I was really thinking about was, if a team is just basically starting, where does AI save time? How do you go to production? Which tools, agents can help you? What tools, agents you already test in real production, and how does it help you? What do you recommend as a start?

Goutham Rao: Whoever asked the question is asking, which agent should I use for the job? We're going to live in a multi-agent world. There are going to be many agents. The people in the audience, you'll be creating your own agents too. Agents will become something that are more specialized, more accustomed to an organization. I don't think you should look at, it's just one agent for everything. It's like, how do you hire human beings? You hire for diversity. You hire different kinds of people, because different people come with different talents and opinions and backgrounds. Different agents will be coded for a specific purpose. Having these agents work together will be something you have to plan for. Again, maybe this is something that is very specific to what we focus on NeuBird, which is the context, the data engineering. Having all these agents have access to a common data source so that they don't argue with each other or come up with wildly different ideas, because they're seeing different datasets, that becomes key. I think you need to have a universal source of truth, an intelligent data extraction platform, so the variety of agents you're using are working better together as opposed to just stepping on each other.

Data Access Control, for Agents

Renato Losio: A few questions are now coming up on one of the things you all mentioned before about the importance of providing the right data to our agents. Otherwise, if just a subset of data is available, it's pretty hard. As a practitioner myself, I feel very confident, very safe to give the data to the agent when it's the data from the platform. Giving to, the example before was AWS, but it can be Azure, it can be whatever you want. I'm using the AWS platform, I'm giving to the AWS tools access to CloudTrail or CloudWatch or whatever, feels safer. When you start to bring all the company information, all the background context, is that a security risk? How do you do it? How do you address it? How do you go over that? Should you do it, actually? What's your experience with that? How do you feel about that, for example? Because I see different questions on that, that are basically related to giving access to the data is great, but where do you stop and how do you protect yourself?

Alina Astapovich: First of all, I really liked Gou's analogy regarding agents being talented in different ways. I just fell for it. This is a really nice thought regarding the permissions and data. It depends what you want your agents to do. Of course, you shouldn't put any production credentials the same way as you wouldn't put it in like source code or hardcode it in GitHub repo, even if your organizational GitHub repository is not publicly available, but you shouldn't do that. Regarding security, so for example, if you will use AWS DevOps agent that we already mentioned, you already have your data in AWS. Why would you really be scared of just giving access to the same data, just to another AWS tool? If you use this cloud provider, it means you trust it. If you don't, then probably maybe you need to rethink what exactly security issues you might have there.

Next thing, what I was thinking about is that if we're talking about pipelines and agents, for example, you want to make RAG MCP over your documentation. First of all, let's think where exactly your AI agent will be up and running. For example, if you really want to have secure stuff, you can use your own virtual machine to run your LLM. Yes, it will be really costly, but it will be really safe because no one will touch it. You can even have your on-prem solution. There are multiple options. It's not necessarily that you need to go and use OpenAI API and use LLM that is somewhere on their data center. You can always use some cloud vendors that also provides LLMs, for example, AWS or Google, but it will be hosted on their machines. It's a tradeoff. Before being scared of something, try to understand what exactly you're scared of.

Renato Losio: What's your experience with that? How do you manage that?

Pavan Madduri: Before going to like data, so I have a thing like, because generally the AI agent, as Gou and Alina mentioned, they act on the context. If you provide the right data, it gives you the better results. If you just give bad data, it just tries to hallucinate and it will divert the issue what we are talking about. That's where we need to audit, first of all, the existing alert to noise ratio, like what it is going. If a human engineer can't distinguish between what are the critical alerts and what are the information noise, that might also trigger an issue. We need to make sure the data or the context that we'll be having to the AI agent is perfectly good.

Goutham Rao: I agree completely with what Pavan and Alina said. You can't just be scared for the reason of being scared or because you mistrust something new. You have to sit down and understand what the whole science behind this is doing. It's not magic. It's pretty well understood what the LLMs are going to do with your data. In cases where somebody has PII type information and they need to mask it, certainly that's something you need to plan for. Other than in corner cases where you have social security numbers or personal information that are probably not needed for the purpose of debugging, everything else is telemetry that is necessary. If you exclude it from the AI because you're scared, it's all or nothing. If you miss out a key piece of information and that's where the problem is and you masked it from the system and you gave it 90% of everything else, you still defeated the benefit of what that could have given you.

Renato Losio: I was thinking that on a personal experience I would probably share any data that is in my logs. I'm thinking about data that's in the database or whatever, there are certain data anyway, even the DevOps engineer working on the ticket, working on the emergency has no access to. There is data that is not going to be provided to the AI, but it's not going to be provided even to the engineer. That's a different level of complexity as well.

Goutham Rao: A hundred percent. That's a good way to look at it. The human engineer on call, what does he or she have access to, to do their job and give all that information to the AI SRE agent, that should be fine. Again, I'll say this on the security aspect of things, especially as it relates to enterprise data. I strongly believe in going through a common platform, number one, because you have a sanitized way of accessing your data, a uniform way in which all your agents can access the data and there's no confusion as to how one agent came up with an answer and a different agent came up with a different answer. Were they looking at two completely different datasets? You don't want to go down that path. Then that becomes very confusing. Secondly, also you have a common enterprise way, a governed way of accessing your information as opposed to everybody from their laptop just uploading all of the enterprise content to an agent. That becomes a little bit more messy.

How to Identify Outliers, Based on the Baseline

Renato Losio: An agent can help to get the data from monitoring tools, but I wanted to understand more, how can we identify outliers detected dynamically based on the baseline defined by the tool?

Rohit Dhawan: When we generally start with AI or we try to figure out these things, like, what is happening and whatnot, we generally start getting good output initially. Down the line we see that these things are starting to fall apart, where now we've used it for a month or two months or three months now, things generally do fall apart. I think the main thing which comes into play is like, how are you ensuring that your knowledge base is what you're trying to use with whatever application you're trying to use, where you plug in your AI. How are you updating your knowledge bases continuously, because it can happen multiple times over time that your knowledge base, your documentation becomes inaccurate. I think that is the biggest challenge as well when you're working with AI, that content becomes outdated and then AI starts to return some responses which don't even make sense. That is an active problem that we should continuously look at. I think from the baseline, that is one of the things I can think about.

Pavan Madduri: I think Rohit brought a new point. Adding the right context to the right tool, that creates a lot of ease, and then that gives a better result. That's what I want to highlight for that.

Proactive versus Reactive AI Agents

Renato Losio: What do you think is the most useful proactive, before an incident, or reactive, during an incident, agent? For example, a challenge I see for reactive agents is that it needs to be fast since we don't want to wait too much time when trying to mitigate an incident. Do you have any advice for proactive versus reactive agents?

Goutham Rao: Proactive is always better. If you can prevent the problem from happening in the first place, that's what the agent should be doing. The other thing about proactive analysis, in building an agent, so far, nobody's talked about cost. These agents also are not free. They're churning through data, and so proactive analysis does take a little bit more money because it's working on things that are possibly not yet a problem but still are an important necessity. I also agree with speed being important during an incident, and this is where context engineering again plays a big role.

If you have the right context, you're going to come up with either an analysis very quickly. In the worst-case scenario what your agent should do is maybe in a reactive scenario if it can't actually solve the problem, it should at least save you a huge amount of time and tell you what not to look at. Because sometimes there may be that small piece of information that only that human engineer would know or maybe there's some information that they forgot to give the agent.

In all cases, these agents, if they don't completely solve the problem in a reactive scenario, at minimum, a good agent, a well-tuned agent, will save you a lot of time. I'll add one more thing over here that a good well-tuned agentic system should be able to do two things at minimum. Number one, it should be able to tell you if it is well set up, given all of the sources to answer, what kind of scenarios can it answer. You should be able to ask your agent, can you solve this type of problem with the information you have? The second thing is, while it is working on an issue it should give you what's called a confidence score. Today the industry benchmarks are, you should be easily above the 85%, 88% range. These LLMs can rank themselves against various scenarios and give you a confidence score. You should be looking for that as an outcome from the analysis.

Finding the Balance between Context, Orchestration, and LLM Reasoning

Renato Losio: How can we find the right balance between context plus orchestration plus LLM reasoning? For example, you can have a very dumb LLM that only knows how to orchestrate using powerful tools. I don't know if it's related to cost that he's worried about, or if he's trying to find the right balance between different tools?

Alina Astapovich: I think the question is about when you're building some huge agentic system, how can you find that balance between providing a lot of context and just firing a lot of tokens, and the right ability of one agent. Basically, how clear your agentic system should be. Small agents that do a lot of small stuff, or one big, huge brain agent. I would love to take this question just because this is my specification for years. The answer here is that there is no answer on that. You can just experiment and try it out, because what I love to do is to build agentic orchestrator, the main brain. This is where in comes the question, this is the brain that's trying to understand which talent, which agent should I send this request to? Then you can have multiple different agents with quite clear prompting in which circumstances, in which use cases this agent should be.

For example, what we are trying to build right now in Storytel and something that we have built before in Electrolux, it's like you have the main agent, the agent brain. Then you can have, for example, observability agent. This agent will be a subagent to your main brain, but it also can become the main brain for other subagents. For example, for observability, you can have cloud logging and tracing. You can have Grafana monitoring. You can have your documentation. Then this subagent will trigger other agents. In general, it summarizes the context and will pass back to the main brain. Here, balance is literally you need to experiment and try it out yourself because it's your architecture. It's your technical landscape. There is no silver bullet. There is no right answer. You need to experiment.

Is AI Accelerating SRE?

Renato Losio: Is AI actually accelerating SRE, or someone argued that it may actually make it more fragile?

Goutham Rao: I completely think that it's only accelerating not just the SRE, but the whole production operations landscape. I brought an umbrella to include DevOps, to include even developers and coding. Whoever asked that question, if they're saying it's hurting them, again, I'll go back to, they must have not provided the right context or the system must not have been set up correctly. These models, if they're given the right context and they have the right context engineering platform and the right data science that has been done, then they will only help accelerate your work and save you a lot of time. The answer for that is pretty clear for me. There's too much data out there. Unless whoever asked this question is magically able to go through all the logs and traces by looking at a screen terminal, which I don't see how anybody could do that. It's like if you were to add numbers and somebody gave you an abacus and then said, here's a calculator. What would you use? It's just that simple.

Renato Losio: Anyone disagree?

Rohit Dhawan: I think definitely what Goutham called out is 100% true. It's definitely accelerating SRE. The only thing I think about in the site reliability engineering world is if you depend too much on AI and you don't start building your domain knowledge, that is one problem that people can have. They'll trust too much AI and they'll stop losing that context. I think everybody should just ensure that they are learning the domain knowledge. They understand it. They can evaluate AI output. I think that is one thing. I think that is one angle where I would feel that people should not lose their domain knowledge. The second thing is it can be fragile when it is not returning the right context, like low-quality knowledge bases and whatnot in those areas, but it's definitely accelerating.

Goutham Rao: I completely agree with what Rohit said and I'll add one thing to Rohit's comment, which is, domain knowledge is key. I'm sure Alina knows a thousand things within Storytel that the models wouldn't know. Luckily now there's this concept of skills. I'm sure everybody here is familiar with OpenClaw, you can upload your skills as domain knowledge and train the agents to do things a certain way, provide them internal tribal knowledge, best practices for how to solve a specific issue. These are all things that can be done. Pretty easy to set up. Like anything, watch what the agent is doing for the first few times and see where it's good at and where it can use some instructions.

Can AI Agents Reduce Mean Time to Repair (MTTR)?

Renato Losio: On one side Jason is asking, anyone on the panel have any experience later yet on realizable MTTR improvement attributed to integration of an agent within the incident management process. The other one is, we're talking about agents. Do we have agents available to develop that we can utilize to reduce the MTTR? Do we have experience, do we have a suggestion, a concrete example of, ok, using the agent, I can reduce my mean time to repair.

Goutham Rao: This is specific to NeuBird, but absolutely not just NeuBird, but I'm sure anybody out there that if you just search for agents and MTTR, I'm sure you'll see a lot of documents. Even on our website, we have publicly documented case studies from our customers where they've been able to demonstrate close to around 90%, if not more reduction. One of them is from a very popular automobile manufacturer. Again, mileage varies. It depends on the organization. To say universally across the board, everybody's going to get that, would be a wrong statement to make. In some cases, people have been able to demonstrate dramatic decrease in the time to incident resolution.

Renato Losio: Do you have anything to add? Do you have any experience? Can you say that agents have helped you to reduce time to repair?

Pavan Madduri: I think as an infrastructure guy, I would like to add something over here. I can't say you can achieve a true 0% MTTR with human approvals that you can achieve it, but you have to build that self-healing infrastructure. The moment AI correlates to a spike or in a trace latency, the infrastructure automatically has to right-size this to the workloads, and even using the event-driven scalers like KEDA or something else before the human SRE gets into the PagerDuty. Before getting anything, you have to make sure even your infrastructure also helps do that.

Handling Permissions in AI Agents

Renato Losio: I think Gou was suggesting that we need to build a bit of trust in AI recommendation before letting them take actions. I think we all agree on that. See what they do before letting them play with our production. Mikal has an interesting question about how to actually let them do it. For example, imagine you have an agent in your infra. How can I manage permissions per user who use it? For example, a senior engineer might be able to restart the pod, but a junior engineer might not be able to do it. How do I deal with that when I ask an agent to do it for me?

Goutham Rao: I think role-based access controls are an important part of agentic solutions and they should assume the roles and responsibilities that the operator has been given. I think any good agentic solution or enterprise-grade agentic solution does offer that already. That's one. I want to add one more comment on to what Pavan was saying, too, which is, there's another dimension over here, on the previous comment around time to incident response and remediation. There are many other dimensions in which agentic solutions help. I'll give you one example. It is not uncommon in a large enterprise when an incident happens that there are too many people that get pulled into what's called the war room.

The Slack channel gets opened up for an incident. I've seen in large enterprises, easily 20, 30, 40 people are on that incident escalation. Whether they have anything to do with it or not, you need to get the network experts, the storage experts, the application experts. Everybody joins and they're going through the same logs. Overall, it's not just a waste of time, but it's a waste of talent, a waste of too many people being there. What a good agentic solution will at minimum be able to do is triage the situation and say, guys, this is a storage problem. We don't need network people here. We don't need the Kubernetes experts. Bring it down to the core people that's needed and drive efficiency in that resolution.

Alina Astapovich: Here, it depends. For example, if we will talk about custom build some agentic solution, not in an enterprise, because this will be right on vendor. For example, if you will make something custom, then it depends on what interface do you use to interact with this agent. For example, what we try to do is to distinguish role-based permissions based on a user's Slack profile. You can have an extra database, like some table somewhere that will identify this user is senior. It has more permissions to do so. Like this user is a junior, so he or she, read-only mode, included into the agent. This is very basic, really light-weighted solution that you can use if you're building the agent yourself. On a vendor, yes, if it's Google Gemini enterprise or whatever, there you can use IAP or any other IAM solution also with roles and principles.

Rohit Dhawan: I think what Alina and Goutham also mentioned on the role-based permission, that is very common to handle these scenarios. I think for the custom agent, what we follow internally when we build these kinds of agents or when we try to build custom SOPs and whatnot, and wherever it is not safe for an AI agent to take any action, we have built two PR kind of mechanism where it just raises a request on an internal tool and then somebody else will go and approve. That just ensures that we don't run into any production scenarios or issues later on. That is another way for the custom handling part. Because in multiple cases or in general cases, when you're operating in a company, you have to build custom SOPs. That is the mechanism that you would need to follow then.

Model Fine-Tuning, and Improvement

Renato Losio: We'll need to establish a good feedback system, some memory MD for AI to learn after each incident resolved. When we say we have an incident in production, it's at 3 a.m., or whatever, and we use AI, I was wondering, in your experience, how do you improve the model with the information of that incident? I don't know what's your experience with that.

Goutham Rao: These agents should absolutely reference past historical issues. I think somebody had asked a question around Confluence, for instance, and that could have a lot of information. Maybe it's not even the agent that solved a similar problem before, a human being could have solved it, and the agent should be able to go back and see if this issue or a similar issue has happened before and what the steps were. Whenever it does do an incident remediation, it should save that back and reference it for future purposes. This is a standard operating procedure, especially in SRE operations. Whatever SRE agent you guys are looking at should definitely do that.

Alina Astapovich: If you remember in the beginning, the very first question, what is the first problem that AI should help? I said, write a huge documentation. I was busy writing documentation, a technical design on memory for our internal agent system. Here we should think about, as Gou said, to store postmortems, or results, or any conclusions from the previous incidents. You also can think about having two types of knowledge. First one is core knowledge, and this is something that can be handled and maintained by your platform, or SRE, or DevOps team. Something that is grounded, something that will not change by incidents. Because AI can make wrong conclusions based on incidents.

For example, someone accidentally turned off the switch in the data center, but it will think that it was a problem with a network outage. Something like this. You can have core knowledge, and second, you can have all these findings and learnings. Your agent can use it. You can prompt it differently. You should think about agents and knowledge and memories the same way as you would think if you have a junior developer. There is some ground truth that you learned in school about physics, math, world, and there is something that you learned by experience. This experience can change. For example, today there is the sun, tomorrow it will be raining. This data should be updated all the time and handled in consideration in your context.

Concrete Actionable Insights

Renato Losio: Say, I'm one of those senior practitioners who joined today, and you convinced me of the benefits of doing autonomous incident response, or try to add some AI to my current approach, that is with no AI help. What is a concrete advice, something simple that I can do tomorrow morning, in half a day, something to start. Let's say that I cannot have everything ready tomorrow, but I want to have one concrete thing that I can do. Where should I start?

Rohit Dhawan: I think the important thing for AI is about the context. For AI to get context, you need to have the solid knowledge bases, because every business, every team, all of these things are different. They need to build their own knowledge bases. That is where you need to start. You need to understand what type of problem you're trying to solve in production for your customers. Start with that. There are definitely a few basic things we early on talked about on the summarization aspects and whatnot. Those are very easy to plug in, and which gives real benefits in terms of saving on-call time from cognitive overload and all of those. Those things are simple. Start with context. Start looking into your knowledge bases. Try to see how you can get those up and ready, and then try to plug in those AI components, and see, how is it returning results? Then the next step comes in about custom SOPs, execution, and automation, and all of those pieces. I would definitely say, start with those pieces first.

Renato Losio: Do you have any advice, if you're starting now and you have limited time? The firsts step?

Pavan Madduri: As Rohit mentioned, it's always better to have the context as the key for the AI to work properly. If it is like, AI gives good signals only when it consumes good data. That's how you need to spend your data, having better existing alerts, like, is it actionable or informational? If a human can't tell the difference between what is this one, the AI agent will even struggle for that. That's one reason. It's always better to have better context, you will get the better results. Another thing is the guardrail approach. You need to have a guardrail like, how much AI has to access and how much the things it has to do it. Those two things are the first things to have in mind when we are going for an AI agent.

Alina Astapovich: I could say the same stuff about context and knowledge, but I will try to be creative, and say, if you are using some kind of vendor, go out and check, maybe they already offer some AI solution that you can use, and try it yourself. Maybe you don't need to think big in those terms. You don't need to think about how to implement 300 agents orchestrated by one pipeline. Maybe you just need to get two separate AI agents, one for observability and one, for example, for infrastructure if you are using AWS or Google. Maybe this will be enough for you. Try to think small first.

Goutham Rao: I 100% agree with what Alina said. There should be a very specific reason why you are building your own agent, and there could be and there will be. For mainstream production operation purposes, there could be an agent that just works for you and you should be able to easily try it out. We get asked this question all the time when people deploy our agents, how do I trust it will work for this scenario? Luckily, it's a very easy process for whoever asked that question. The first thing that I would ask anybody to do is, before deploying an agentic system, you probably have a prior issue you ran into. At minimum, your agent should be able to go look at that prior issue and come up with the same final root cause analysis that you came up with. If it's doing the same thing, then you know that it's working for those type of scenarios. You're not walking into a system where you've never seen an issue before. You, as in whoever's in the audience, you've run into these issues, which is why we're having this conversation. Make the agent do a retrospective analysis of how it would have solved the problem the way you did it, and if it is doing it the same way or maybe a better way.

See more presentations with transcripts

Recorded at:

Apr 28, 2026

InfoQ Software Architects' Newsletter