InfoQ Homepage Presentations Fix SLO Breaches before They Repeat: an SRE AI Agent for Application Workloads

Fix SLO Breaches before They Repeat: an SRE AI Agent for Application Workloads

View Presentation

Speed:

Download

48:17

Summary

Bruno Borges discusses a paradigm shift in performance management: moving from manual tuning to automated SRE agents. He explains how to leverage the USE and jPDM methodologies alongside LLMs to reduce MTTR from hours to seconds. By utilizing MCP tools for real-time diagnostics and memory dump analysis, he shares how engineering leaders can scale systems while meeting strict objectives.

Bio

Bruno Borges is a Principal PM Manager at Microsoft with over six years of dedication. Bruno specializes in Java runtimes development and enhancing developer relations for Microsoft Azure, aiming to create the best cloud environment for Java workloads and developers. Prior to Microsoft, Bruno served nearly six years at Oracle, influencing developer relations strategy and governance.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Bruno Borges: My name is Bruno Borges. I work at Microsoft. Today we're going to talk about SLO breaches. Who here deals with that? Who doesn't want to deal with that? We're going to talk about defining performance, setting SLOs, setting objectives. What is performance diagnostics? How do we do it? Then we're going to talk about SRE agents. If it can be automated, it should be automated. That's our mantra. This talk is about troubleshooting performance needs and SLO breaches. It's not about performance tuning.

Advanced performance tuning, I think that really depends on the type of workload that you have, the type of stack that you have, the type of language runtime that you're using. Who here deals with JVMs? Who here deals with .NET CLR? Go? Node? I'm not going to give you a lecture about resilient architecture. There are many more people and many books published over decades on that. That's where you should go for learning on that. Also, SRE practices. We had a talk from David from Microsoft as well, and he did a great presentation about SRE. Those are the people that you have to learn from, not from me. Hopefully, this talk gives you the bigger picture, so you can later go after and deep dive into each content, each section of this presentation, and learn in a more advanced level from others.

Defining Performance

What is performance really? Anybody here likes music? Everybody likes music? You like to go to a live concert? That is a performance. You go to a concert with an expectation that it's going to be great. It's not going to be too short, it's not going to take forever, and it's just right there. You pay the price, and you expect that that price meets your expectation for the band you are looking for, or the play, or whatnot. Performance is really something around that. There's this example here. Your application can be fast, but if it's running hot, costs a fortune or breaks under load, it's like a sports car stuck in city traffic. It's just not performing. It can go fast, but it's not performing as it could, all because of the environment, the city traffic.

Performance is not necessarily about speed, it's about meeting expectations. It's about when you have a comfortable car that you can commute and it doesn't break in the middle of your journey. It's about being consistent, efficient, and within cost, time budgets. There's a better way to visualize this. You see this bar at the bottom. You can have performance of systems, the system is too slow, it's not meeting expectations, but it's also too expensive. It can be super-fast, but it's not meeting other expectations. That is what performance is about. When we think about performance, it's like, it has to be fast. It has to have high throughput. It has to have low latency.

As long as it's balanced, you don't want to spend millions of dollars just to be fast. You don't want to spend millions of dollars just to have low latency, high throughput. You want to have the right balance. Sometimes performance tuning is something about, how do I bring my cost down, or how do I raise the velocity of my system without incurring more costs? That is what you're trying to do. You're trying to meet in the middle, trying to meet the expectations from cost, time, and customer expectations. SREs, they tend to care about efficiency and reliability, not necessarily about speed. If it's fast but breaks often, or cheap but fails SLOs, it's not performing, not necessarily not going fast, just not performing. There's another visualization that you can try to mentalize. Imagine a triangle and you can have in one corner speed, in the other corner cost, and in the last corner, customer expectations. You can pick two, but the SRE job is to actually balance all those three.

The way to scale is actually to do less. How do you scale a system without incurring more cost? You try to do less stuff, so you don't need more resources. When the system struggles to scale, for example, your first solution to mitigate the problem may be to add more replicas, to add more VMs, and just have more instances of your system or more nodes in your database cluster, for example. Anything that will scale out your application or scale up your application may mitigate the problem, but it may not actually give you that balance. You're just incurring more cost. A good example, imagine a grocery store.

Adding more cashiers help, but if you can just train your cashiers to be faster in validating the ID, for example, of a customer when they are buying alcohol, for example. It's my grandpa, he doesn't need to show his ID. The cashier just, go through, you're welcome. That speeds up the process. You eliminate one part of that process to go faster. The challenge is you need to find where that step in the process in the system is slowing down the whole flow.

Then when you identify that, you identify the bottleneck. The bottleneck is a rate-limiting activity. Here, another example, like you have this production line and one worker is still reading the manual to understand how to process the part. That slows down the whole line. When you find that step in the process, you optimize it and then everything else flows correctly. This can be CPU, memory, I/O, thread pool, just bad code, code hot path. You want to identify that and optimize that.

Sometimes you can optimize a bunch of things and that actually doesn't help the system in general. You still have that bottleneck living somewhere. Back to music. Here's an interesting exercise. Imagine you're at a concert and there's only one door to get into the venue. There's one security person checking the bags. Then we add another door. Where's the bottleneck? It's the security check for bags. It's just one. Now you have two doors, twice more people getting through, but still one. That's very obvious. I just want to do one quick quiz here. What if we still have one single door and you have this wait line building up for one security check, but then you add another security check? What happens?

Participant 1: Bottleneck at the door.

Bruno Borges: The bottleneck at the door? Not really. It's not necessarily where the bottleneck is, it's just like what happens to the wait line. What happens to the single wait line at the door when you add two security checks?

Participant 2: It gets to go a lot faster.

Bruno Borges: It goes a lot faster. Yes, security checks happen twice as fast. What happens to the wait line? At what speed the wait line gets shorter, at what rate?

Participant 3: Twice as fast.

Bruno Borges: Twice as fast? That's the interesting thing, it's actually exponential. There is a formula in queuing theory to find this. You have to know the service time, the arrival rate. It goes back to queuing theory, you can find what is the speed. It's actually more than twice as fast, just because now you have two security bag checks that one of them maybe takes one minute, but the other one, the person doesn't have a bag, so he just gets through, so it's a lot faster. The wait line will just move on because now you're doing this. Same thing when you go to the supermarket, to the grocery, and you're looking for hopefully a single line that goes to all cashiers. That is the best optimization for a grocery.

For many reasons, sometimes psychologically, people like to choose because they think they're smart, they want to go for the line that they think is the shortest line. Sometimes that actually tricks you. Moving on from queuing theory. Again, you found the bottleneck and you want to optimize that, what are the steps required to do that? You want to do performance tuning. You want to tune the performance of that particular step in the flow. You do have to have a methodology in mind, and we're going to cover here that in more detail, but you do have to have objectives. Also, we're going to cover this. You have to understand the application architecture.

Most important, you have to timebox. All these steps in the process, in the system, in that REST call, or in that function, whatever, you have to timebox everything so you can know, this call here to this REST API spends 100 milliseconds doing this, 200 milliseconds doing that, 500 milliseconds to the database, and so on. You start doing a time budget of that function in your system. There's a little bit more you should be aware, like mathematical concepts and principles, comprehension of technology. Again, I've seen teams that look at the JVM and they just don't know the details of the JVM, how to optimize it. Same for CLR, same for Deno and V8, same for Go. There are knobs that teams should be aware of that technology. Kubernetes scheduling, CPU throttling, all of those topics are extremely important. Then all the instruments and strategies that you can use to observe, adjust, try, validate, test the changes in production.

We just described what is performance tuning. I want to give you just one slide summary. You have to have objectives. What is an objective? Ninety-five percent of my requests are below 100 milliseconds. I want to have my p95 in this case here. The throughput observation, I have an alert that will trigger if the throughput drops under 5,000 RPS. Then I have to diagnose that problem. I have to find the symptom. Then finding the symptom is where SREs and developers and engineers, ops, whatever, spend most of their time because you just got this alert. "The CPU is too high, memory is too high, I/O is taking too long, what's going on?" Then you have to start digging into the system to find where that problem is. Only after that, you go into tuning mode. I can throw more resources or I can find the bottleneck and fix it.

Disclaimer, sometimes the solution is more resources. It's totally ok, as long as you understand it. Still like tuning that right knob, for example, more resources can be just increasing the CPU and memory limits in your containers on Kubernetes. That might be the knob, not the JVM knob, not the application knob, not the database, just the infrastructure, might be as well. Which knob do you find? Which knob are you going to look into and tune that? It's not surreal to think like, an F1 pilot does not touch most of these buttons during the race, they do that during qualifying. During the race, they will adjust one or two things. They will click the DRS button every lap when they're within the range. That's pretty much it. Some pilots may click the wrong button and open the radio. If you're following F1, it's hilarious.

The thing about an F1 driver is that they have an objective. They want to go fast. They want to burn the right amount of fuel, because F1 now you cannot refuel anymore. They want to use the tire to the right level before they have to go to the pit stop. They have objectives as well throughout the race, not just winning, but being consistent, having a pace throughout the race. That's why it's important for us as system engineers, as developers, to have objectives.

Setting SLOs

In my conversations with the engineering team, there is a lot of discussions, which term should we use, SLA, SLO, SLI? SLIs are interesting, but I'll just focus on SLAs and SLOs here. In almost all cases, SLA is just the legal document. It's just the legal agreement between the vendor and the customer. For us, for engineers, what we actually care about are the objectives. I really like to reinforce the word, because without objective, you don't know what you're looking for. You don't know where to look for. Once you have those objectives, for example, CPU usage, call time, response time, latency, throughput, whatever, then you can go back to your drawing board and, ok, let's time budget these things so we can know how we can stay within those objectives. Here, for example, we have a client application, client application server, a backend, and then the database.

Each jump in this diagram, in this connection, you have to have an idea how long does it take to make that jump. A great tool for that is OpenTelemetry. Many frameworks have this built in, so you can just enable. It will get you this great visual, and you can even customize the events in OpenTelemetry, and you can make them more business related instead of technical. For example, time to loan approval, that can be an event that you can do in OpenTelemetry. It gives you that flexibility so you can have objectives that are sometimes not necessarily technical, they're more business. They're also very interesting, because they relate and are more easy to understand from a customer perspective.

Next, there is a performance issue, and we want to solve that problem. We can break it down, like the time to resolution, MTTR, can be broken down into three blocks. One is observation. We can use APM solutions. We can use log analytics. We can use dashboards, alerts, PagerDuty for that. Actually, PagerDuty just gets you the alert, something bad happened. You get all this information from these solutions, and hopefully you have automated a lot of things that, depending on certain events, you already have some mitigation that can be done in the very short term. After the mitigation is done, then you can go back to the second box, which is, let me diagnose this problem now, and try to resolve the issue. The problem is when something bad happens, you just don't know what you don't know. Because of that, it's really hard to find where the problem is.

Supposedly, in many cases, that's where the vast majority of the time in your time budget is spent. Finally, once you do find the problem, sometimes the solution is actually pretty quick because you know where the problem is, and the solution may be very fast. Sometimes the deploy time may be a little bit longer. This is fixed, but yes, we're going to take two weeks to deploy because that's the deploy window. It depends.

Just back to the topic of this talk, SRE agents, which box here do you think an SRE agent will be good at? Observation? To some degree, yes. Diagnosis? Who thinks it's repair and deploy time? Who thinks it's diagnosis? Who thinks it's observation? It is diagnosis, because that is the part that's really complex because there are so many layers throughout the system that you have to go through, so many processes and tools and logs, to just find where the problem is, where the bottleneck is. That is the part where if we can automate, we should automate. Many teams already automate this part without AI, without LLMs. If we could go even further, I think there is a lot of value there.

Performance Diagnostics - Methodologies and Tools

To automate, you have to have a methodology in mind. You have to have a process in mind on how to diagnose. What I'm going to do, when I'm going to do, which tool I'm going to use. All of that is extremely important to have a consistent process. There is a method called the USE method. It's about underlining the problem, identifying useful solutions, and then evaluating the impact of those solutions. That method can be applied to many things, not just software engineering, operations of airlines, operations of logistics, supply chain, construction, HR departments. It's a very useful method that can be applied in any space.

Kirk Pepperdine designed this model called jPDM. It's a performance diagnostic model originally for Java. It can be used for any language runtime. It is a top-down and then bottom-up approach. It looks at the time budget first. You have to have those metrics in the system. Like I said, it can be applied to any language runtime. Think of these methodologies as your playbook on how to go through the journey of diagnosing the performance issue. Once you know what the system is doing and who's making it do it, like the actors, then you start getting a good understanding where to investigate.

Systems are consumption flows. There is an actor, that data gets in and then just goes through that pipe and something happens. What is a consumption flow? Consumption flow, you can imagine these layers. If you have the actor, it's a user, a job, it's a batch schedule, it's a database event, it's an event system, it's another API. Now it's an AI calling an MCP tool that eventually calls your REST API. There are many things that can be actors.

Once that request comes in, then you start this whole pressure throughout your system. It gets through your application, your business logic, it gets through the JVM or your language runtime, CLR, Go, all the way down to the hardware. There are actually extra layers that are not in this chart, but you can think about cgroups, containers, VMs, hypervisor, many things that are somewhere in between your managed memory and your hardware. Once you understand those layers, you can actually start thinking about a process on how to diagnose. Kirk designed, in the jPDM, this great flow.

First, you think about, how do I classify the bottleneck, the issue that happened in my system? I got an SLO violation. Is it system dominant? Is it application dominant? Or there's no dominance at all? Should I be looking somewhere else? This chart is available online. You can search for jPDM, you can learn a lot. Just so you understand, have a methodology in mind. There are many methodologies available, but there are some that are very effective already that have been tried in many customers in production.

If you can take advantage of that, do it. If you do have your own methodology, think about how you can automate that, especially with AI, moving forward. Something happened in production. How do you classify that? You can see that we read some numbers here, CPU usage, system CPU usage, or user CPU usage. Which tool in Linux will do that? We can do that with vmstat, just for a start. That will help us identify where the dominance of the bottleneck is happening. Is it the application, system, or no dominance? This tool will give you a lot of data, especially if you're continuously looking into it. It's great, but just a snapshot is almost always sufficient at the time of the problem.

System dominant. You can investigate the symptoms using the USE methodology. You look for the context, switching rates, eventually you're going to jump into the language stack of what you're using. Then, you see how many tools are at that level, just because there are so many tools available, and you have to know these tools. SRE engineers are great at that. Ops teams are great at that. What if we could automate that? This is a chart from Brendan Gregg on the USE method applied on Linux for troubleshooting. His website has great resources, so you should definitely check it out. This chart just gives you an idea of how complex and hard it can be to find the actual bottleneck. You can find something that looks like a bottleneck, but how do you actually find the bottleneck? Here's a slide on application dominant. What do you do? You check if it's the garbage collector, you check the observability agent, your New Relic, whatever. You get the GC login so you understand the pattern of how the GC is working.

Then use an analysis tool. What you're looking for here, when you look at the application dominant issue is how the runtime is managing memory, how the runtime is managing threads, how the runtime is managing allocation rate, things like that. You can adjust the runtime to your application needs, always going back to the objectives that you have in mind. Here's an example of a tool that we use internally at Microsoft for Java. In this particular screenshot, we can see latency. We have a few dots in the chart that were really bad compared to the average at the bottom. Why is that? This is the pause time for the garbage collector in seconds. That pause time took almost 44 seconds in production. Everything else is below 1.5, 1 second. Really good, but something is going on in this system. Why does a long pause time in the GC cause poor latency? Because if the GC is working for too long, your application is not working.

That's the general principle. Which means for that timeframe of 44 seconds, almost all requests to your application will take at least 44 seconds. What about throughput? This is an example of pause time as a percentage. Throughout the lifetime of the application, if too many pauses are happening, again, your application is not processing enough. One request may go through fast, but because it's pausing so frequently, your application reduces the level of throughput. These signals can give you memory complexity.

One solution is to tune the GC. Another solution is to reduce the allocation pressure of the system. Another solution is just to reduce the live set. There are different tools for different stacks. For Java, you have VisualVM and JDK Mission Control, which are great. Memory complexity, it's very common, but it's one of the least visible. To identify where the issue is, when you're analyzing memory, GC logs, things like that, it can be tricky. It's not necessarily about how much memory we have. Do we have enough memory? It's about how the memory is being used by the language runtime.

Let's look at a few charts to understand how that happens. Here we have a chart that shows heap size usage. Does this look like a memory leak? When you have memory going like that? It's not necessarily a memory leak. Sometimes the application is doing what it has to do and the GC is trying its best to keep memory consumption within the limits that were set. Clearly, it's not freeing memory anymore. Not necessarily a memory leak. It is what it is. This, on the other hand, it's a very healthy application. You see memory consumption and the GC working and cleaning, working and cleaning.

This one, on the other hand, this looks like a memory leak, because the cleanup happens and then you start having more and more objects always in memory. More GC work being done, and the chart never goes down. The frequency of GC just continues increasing. This is a memory leak. Going back to that chart that we have about MTTR, you have diagnosis, takes a long time. You have to look at all these data, all these charts, try to make some sense of it. Sometimes takes hours, hopefully. Sometimes takes days. Sometimes can take weeks, depending on the complexity of the system.

Then the repair time, again, can take hours, days, or weeks. What happens if we automate even further? There's a lot of automation being done already, but what if we can automate even further with LLMs and AI? That's when these two boxes can actually go back down to seconds and minutes. Because now you have this methodology, whatever you choose, with access to all these tools in the environment. Then all these methodologies can be applied repeatedly by an agent. That agent follows your methodology, gives you an analysis, gives you even a recommendation.

Once you have that recommendation, you can feed that into a developer agent, coding agent, that will apply changes in your code. If you're brave enough, you do that all automatically, and then just goes to production. If you're not so brave, and that's ok, you actually want to review that before it goes to production. If you're doing something like a canary deployment, A/B testing, whatever you do to segregate your load, you can actually have a percentage of your system in production applying changes live. If it's all going well, it's all going well. Just replicate that change to the rest.

SRE Agents - With Great MCP Tools Comes Greater Automation

We got down to performance, objectives, methodologies, and tools. We have everything in hand to automate. Now all we need to do is throw some AI on top, and voila. Not so simple, but it's not that hard either. MCP tools are a great solution for this problem, because you can have these MCP servers running on your environment with access to these Linux tools, language runtime tools. Then, when an SLO breach happens, this MCP server can be triggered and automatically start diagnosing the problem. It's almost like you get a PagerDuty alert, something is going on, and then you look back to get a glass of water because you're nervous about it. Then, after you finish drinking the water, another alert, system is ok because your SRE agent just fixed the problem. Fingers crossed, that's what we want to achieve.

If that's what actually happens, it really depends on many things. What is an SRE agent and how do these things work together? The Autonomous AI agent for this thing will have access to those performance tools, hopefully using MCP because it's a great way to allow the LLM to do the things that you only want it to do, and access to all these tools, vmstat, jps, strace. Hopefully most triggers are automated based on those objectives that you have, but it can also be human triggered. You can go in your IDE and say, I'm curious how this is performing, run that diagnosis for me. Then you get this result, say, I can actually make some optimizations here, even before an alert is triggered.

The most interesting thing, you go from hours and days to seconds and minutes for diagnostics. You can do your assessment, assess the recommendation from the AI, and then you can apply the changes. That's where the interesting not so far away future of SRE agent and SWE agent, software engineer working together to apply those changes in production or in a staging environment, whatever. Those changes, we review them, we analyze them, but you don't have to dig into all that stack, all those layers to find the problem, that was done for us. Some capabilities here, runs 24-7. Maybe you can reduce how many hours you're awake at night, concerned about the system, and just sleep more relaxed. Infrastructure best practices, go back to your methodologies. You automate the response, and you can automate even mitigation, like a mitigation agent, and literally just accelerates your root cause analysis.

Demo

Let's look at a demo. I got two demos here, one in Java and one in .NET. Hopefully, I please the most cool enterprise ecosystems that are fighting for their lives against Go, Python, JavaScript. Before I show you the video, I do want to show you what it looks like, at least from our work-in-progress Java solution for SRE agents. We have this example here. This is the result of the analysis that happened. You can see here, I did a return list of all diagnostics performed, and then it will use MCP to identify, what are the diagnostics that happened in the server? This is my configuration file for my objectives. I have triggers here for latency. This goes into my application along with my APM solution. Once this is in production, now I can go and ask my SRE agent, we call it Illuminate for Java here, what's going on?

Then, yes, I can see you're looking at the IlluminateProfiles.java, which is actually irrelevant for this context. This is more like a human wanting to diagnose something in production regardless of triggers. For this demo here, we run diagnosis every 60 seconds. We have a JMeter instance triggering the APIs all the time, and then every 60 seconds, we have an alert, and that triggers the diagnosis. I look at the diagnosis. Here's an example of what that JSON looks like from this MCP tool.

Based on this information, I can see there's a diagnosis available, the system detected a performance issue with your petclinic/api/owners. This particular REST API is not doing so well according to the objectives that I set for my trigger. It will dive into that. I'm going to get this diagnosis, this specific one, and here's my output. What do I have here? Identifier, system is 0, idle is 99%, weight is 0. For demo purposes, don't ignore the numbers, but the idea is you got that diagnostics file. Then from there, the MCP tool combined with the LLM, they can investigate the problem. It has access to other tools as well at the Kubernetes level, because our SRE agent is deployed to Kubernetes.

There was an Alert Type: REQUEST latency breach. This API, it took longer than it should. Ninety percent of requests should be under 2 milliseconds. We throw some random Thread.sleep there. It's a 10-second monitoring window, which means hopefully we don't get this triggered all the time, because we want to limit the window so we don't overflow the alert system. Here's the minimum sample, 30 requests required. After 30 requests happened in the sampling, then it triggers the alert. Key findings. This is the LLM working with the MCP tools from the SRE agent trying to figure out where the problem is. The diagnosis shows too much waiting as a primary issue.

The CPU usage is very low, so it's just like waiting, so there's some blocking there. No resource usage. Likely root causes, database connection issues. Maybe, again, your application is just waiting for the database to respond, or it's a network I/O bottleneck. Maybe the network of your cloud environment is having some outages, thread pool starvation, or synchronous blocking operations. It helps you narrow the problem space, so you can investigate only the things that are potentially more directly impacting the issue.

For example, there's no disk usage in this analysis. The disk usage is bad. It doesn't matter because that's not where the problem is. It gives you some recommendations. Here's the recommended actions. Let me check your Spring PetClinic applications configuration files. The reason this is showing, let me check the PetClinic configuration files, is because I'm doing the analysis, the diagnostics in Visual Studio Code with the project open. GitHub Copilot, the LLM will just try to look into everything based on the context that you have.

The next demo, the .NET demo shows this solution more on an online dashboard like Azure DevOps where you just look at the SRE from a production perspective. It doesn't have access to the source code, and it happens on the server side. Here, the analysis is happening in conjunction between the SRE agent deployed on Kubernetes with the LLM running locally on your computer with access to the source code. That's why the diagnostics here is slightly different from the Azure DevOps SRE agent. Here's the summary. Here's the profile ID, the trigger that happened, the value, the threshold, the minimal sampling, the diagnostics result: TooMuchWaiting. What are the key insights? Then the recommended next steps. That's it. You have an SRE agent that narrowed the problem to most likely what's causing the bottleneck. Instead of spending an hour looking at all logs throughout all those layers, you just go straight to where something bad has happened.

This demo here, this is App Insights, 300 requests per second happening with a 20 milliseconds response time. We can see the alert happening here. We automated the SRE agent with the system looking at the logs. We can look at the logs, and then, there's an alert. Trigger the alert. Here's how the configuration is stored. Everything is deployed in Kubernetes. You have the SRE agent. You have the diagnostics agent. You have the monitoring done at the application level with APM. Then, this will go through what I just showed you.

In this case here, one of the suggestions I think will be to increase the heap size. "Would you like me to get more detailed information?" Again, this is the LLM working with your local context, looking at the deployment set configuration file. I got my diagnostics. This is a GC pause issue. Here's a suggested fix. You should change the heap size. This could take hours for somebody to diagnose in production. Of all the logs and configuration files and everything that you have to look into, the GC logs are not simple to look into, especially for getting started at that. Once you got to grasp it, it's not that difficult, but there are many tools that help. This is what Illuminate is doing. It's using the GC logs and using tools that will parse the GC log and find the optimizations. It changed the amount of memory for the container limit, and it changed the heap size for the JVM. The engineer just says, apply the fix. Great recommendation. I approve it. Goes to production.

Let's look at another demo here. This is a .NET demo.

Engineer: Now let's demonstrate how we can use the SRE agent reactively to respond to incidents in real time. I have another application that's deployed to a container app. This application here lets us simulate various behaviors of an app and test the resilience. You can see, I can create deadlocks, I can trigger high CPU, and I can consume memory of our resources, in this case, the container app. What I want to highlight in this scenario is I want to highlight only issues that arise under specific conditions, so not necessarily due to a configuration error. I want to simulate a memory leak that's going to cause this application to crash. There's some source code that it doesn't clean up and it consumes too much memory, and eventually we've run out of memory of our application and it falls over. This container app has been integrated with an incident management tool called PagerDuty. I've configured PagerDuty to send alerts when the app becomes unreachable, or it hits a certain error threshold. Those alerts will notify the on-call engineer, which is currently me. Let's do this to that. Let's spam until our application runs out of memory. Just like that, we're now seeing 500s. Our app is unreachable, it has fallen over.

As soon as this happens, PagerDuty is creating an alert and it's calling me, it's notifying me of this incident. What I want to show is how I've set up my SRE agent to handle alerts like this from PagerDuty. Here's the SRE agent that's deployed, another one in my subscription. Again, we can confirm that it's managing the right resources tied to this application. Here's my container app already marked as unhealthy. If we go to the settings under incident management, we can see how I've configured my agent to integrate with PagerDuty. I have my PagerDuty API access key here.

If we go back to our agent, a thread should be created, and here it is, from that alert. The ICM that was created from PagerDuty is pulled into a new thread for our SRE agent. It's using that as the prompt to diagnose what this issue might be. It's kicking off that investigation. It's running the workflow to provide that root cause analysis. You can imagine if you're an on-call engineer, we've all been here, and we're going to bed and we're really afraid that we're going to get that 2 a.m., 3 a.m. phone call waking us up that there's been an incident. You can rest assured that the SRE agent is already on top of it and already kicking off a workflow to diagnose the issue and hopefully resolve what that issue might be.

This is going through its process. What I want to do is jump to a thread that has already been completed because it does go through quite a rigorous workflow, checking all the different settings of our apps, the health, doing things like memory dumps, which can take a little bit of time, but again, it's a very thorough investigation. We'll switch to a thread that's already been completed and we can see that this incident had been remediated. Our agent was able to detect that, yes, the memory was extremely high, up to 97% in 500 errors did start to appear in our logs. It performed a memory dump on our behalf. Again, these are tasks that are not simple. It was able to do this within minutes. It provided this memory dump.

Then what it did was it knew that it could temporarily mitigate this issue by scaling up our app, scaling up the memory and the replicas to three to work around this issue. Since it recognizes that this incident wasn't necessarily due to a configuration change, more likely something that was introduced with code, it raised a GitHub issue with the details of that memory dump. We can actually hop over to our GitHub repo and I can look at the thread that had been open. Says it's me, but really the agent had done this on my behalf.

Here's all the information from that memory dump so that when a software engineer who's responsible for this part of the application can come in and look at this and address this with a more permanent fix. What I want to show lastly is that you can always jump back to the thread from these issues here. You can see that full investigative analysis that had happened from the agent. Just like that, those are two different scenarios where you can use the SRE agent proactively and reactively to mitigate any issues that come up with your applications that are running in Azure.

Key Takeaways

Bruno Borges: I'd like you to keep this in mind. Performance is not about speed. It's not just about going fast. It's about meeting expectations. Set those objectives so you know if the expectation is being met. That's extremely important. Otherwise, you don't know what you're looking for. The methodologies like USE or jPDM, they do help. They help you identify the bottleneck. You can do manually or you can automate or you can use AI automated for you so they can do everything throughout that methodology. Again, what is the next step in automation? Maybe that's the question here for this problem space. If you have lots of scripts and alerts and monitoring, that's great. You're halfway there. Because doing performance diagnostics is not fun. It can be fun, but it's not that fun when you have production systems being impacted.

If you can automate further, do it. LLMs are great for this. MCP tools are the next step. Giving access to those tools, again, with very strong guardrails, it's extremely important. Our solution for SRE agent does have different modes. You have the read mode, you have the write mode, you have just the observer mode, things like that that can really limit what the SRE agent can do in production. You can choose your level of bravery of letting AI diagnose problems in production. It's completely under your control. You can do it yourself. You don't have to use Azure SRE agents. I'm sure many other vendors are working on solutions like this. You can do it yourself. You can go and get those LLMs and just integrate into your system. Local LLMs are a great solution for this as well, because they can run easily next to your workload.

See more presentations with transcripts

Recorded at:

Dec 29, 2025

Bruno Borges

InfoQ Software Architects' Newsletter

Fix SLO Breaches before They Repeat: an SRE AI Agent for Application Workloads

Summary

Bio

About the conference

Transcript

Defining Performance

Setting SLOs

Performance Diagnostics - Methodologies and Tools

SRE Agents - With Great MCP Tools Comes Greater Automation

Demo

Let's look at another demo here. This is a .NET demo.

Key Takeaways

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ