BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Trustworthy Productivity: Securing AI-Accelerated Development

Trustworthy Productivity: Securing AI-Accelerated Development

40:32

Summary

Sriram Madapusi Vasudevan discusses industry-converging patterns for securing autonomous AI agents in production. He explains the critical vulnerabilities hidden inside the ReAct loop across context, reasoning, and tool execution. He shares how to mitigate risks like memory poisoning and rogue tool execution using defense-in-depth strategies, LLM-as-a-judge critics, and MAESTRO threat modeling.

Bio

Sriram Madapusi Vasudevan is a Senior Software Engineer at AWS focused on building AI agent-ready developer experiences. Over the past decade, he has worked on large-scale platforms such as AWS CloudWatch, Rackspace Cloud Queues/CDN and open-sourced developer tooling such as AWS SAM CLI, AWS Lambda Builders, and created the AWS Homebrew tap.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sriram Madapusi Vasudevan: Trustworthy productivity, what could the worst be? What happens when your autonomous AI agent goes rogue? Let's rewind back to July 2025. Replit had a massive production incident. It all started with a relatively innocuous prompt, clean the database before we rerun. A simple instruction during a code freeze. What could go wrong? Turns out it was everything. Cleanup was the key word that caused a catastrophe. Jason was a SaaS founder who was coding with the AI agent and he had enforced a code freeze at that point in time. He had been vibe coding for nine days straight. He told the agent, clean the database. There was a key misinterpretation at that point. The agent equated clean with dropping the database. The execution resulted in destructive SQL with production credentials. It nuked live data. Remember, this was nine days' worth of work. The aftermath was all over the Twitterverse.

The agent apologized, saying this was a catastrophic failure on my part. I destroyed all production data. It even said that it could not do any recovery. The CEO went into high gear. He said that we're rolling out automatic dev and prod separation in a planning only mode. Why is this everyone's problem? I'm pretty sure everyone is vibe coding to some extent. The outage happened during an enforced code freeze. Guardrails weren't present. If it happened to Replit, it can easily happen to you.

Your Guide

I'm Sriram Madapusi Vasudevan. I'm a Senior Software Engineer at AWS. I build agentic AI systems at AWS, pushing autonomy into production. Today, I'm going to be sharing the patterns the industry is converging on for securing these AI agents. What you'll walk away with is an understanding of the ReAct loop, a clear framework to protect your autonomous agents, some field stories across context, reasoning, and tools, mitigation, enterprise-grade strategies rooted in proven principles, and where do you even start? How can you add value from day one?

ReAct Loop: The Agentic Loop We Must Defend

How many of the folks here have heard something called the ReAct loop? The ReAct loop is a reason and act loop that the AI agent can adopt that goes from reason to action followed by observation that goes in a cycle. The agent can now effectively problem solve by breaking down a complex problem into manageable subtasks and gathering more information before it can figure out its goal using tools along the way. This is the agentic loop we must defend. There are three critical stages where vulnerabilities emerge. Here's our agentic loop. We have context management, reasoning and planning, and tool action execution. These three in a loop is what comprises your agentic loop. At some point when an exit criteria has been met, we short circuit out and go back to the user. Context. Let's start from the top. Context is what we feed the agent. I'm pretty sure folks here have also heard of the term context engineering.

It's at the forefront of your agentic loop. It's story time. Let's talk about a context corruption story that happened in the wild. IBM documented a Fortune 500 financial firm's agent memory poisoning incident. Unverified market data entered through the RAG, embedding subtle adversarial cues. So what? The agent cached this to long-term memory, skewing decisions and bypassing normal reviews, causing millions of dollars in loss before this was even uncovered. Let's look at a few context failure patterns. The first one is memory poisoning, fresh from our previous incident. Unsigned RAG payloads that insert, "from now on, auto-approve," are key things. Remember, everything that's in the context of an AI agent are instructions to the LLM. Privilege collapse. Imagine you are working with an AI agent that doesn't have a concept of tenants. That means that entire sets of tenants have been merged into the context. Any isolation guarantees that you are attempting to give have evaporated.

When you're working with agentic systems and you're working with a swarm of agents or hierarchical agents, they still have to communicate regardless of the protocol that they use. In some cases, any inter-agent chatter could overwrite priority envelopes. This means that, let's say a lead agent was given a certain goal and it broke it down into certain tasks, and a sub-agent actually overwrote the task structure. Now you're no longer working towards the goal that the lead agent wanted you to work towards.

Let's map those threats on the agentic loop. You'll see this slide fill up with a lot of words. What are the context defenses that you can put in place? The first pattern is called provenance gates. Here you want to treat your context like a supply chain. The analogy I can draw is towards having customs inspection at the border. Anything that is not on the manifest triggers secondary review. How do you implement this in the real world? You validate your connector signatures. You enforce allowlisted schemas. You quarantine drift. Let's look at a real-world example. Let's say you're building an HR assistant which is searching internal knowledge bases. There's a scenario where you want to search for the vacation policy across a number of spaces. You set up a gate saying my only accepted search cases are from approved spaces like this Notion space and this particular HR updates Slack channel.

Each result needs to carry a signed connector token and an allowlisted manifest with only X allowlisted fields. Another context defense that you can think about is mission-scoped memory. Before we get into mission-scoped memory, let's talk a little bit about short-term memory and long-term memory. Short-term memory is the series of conversations that you're having with an AI agent and that rely on the existence of the context window and maybe some of it is stored into a short-term store. Then there's long-term memory where from the interactions that you're having across multiple sessions, key portions are summarized and added to user preferences. You want to start out with the least-privilege access for state. What I mean by this is you start off with just local memory and you ensure that between sessions you guarantee isolation. You partition memories that are being created per task with a TTL. You do not want them to live in long-term memory for long periods of time without a prescribed promotion strategy.

Remember our incident from before. Anything that is in the context and makes its way to long-term memory without promotion criteria is a big risk. You want to label the memory shards per mission. Are they coming from production? Are they coming from a certain user? You can then even ensure that you have role-based access control based on the tags that you're applying to the memory. You want that to expire on a given timeout, and that can be configurable.

Let's look at this again through an example. Let's say you're building an AI agent that does code reviews for you. First, you want to scope it down. You want to look at a particular repository and a particular pull request. Let's say you want to look at it with a 48-hour TTL because there were a number of revisions that have been coming in into the PR. You want to gather suggestions on how other developers are reviewing those pull requests. You collect suggestions across multiple repos. Now comes the key portion. You want to figure out what is the promotion criteria if you want to move things to long-term memory. Is it a recurring suggestion that is happening in greater than three pull requests on the same repo? Is it approved through a maintainer label? Is there a linter rule that was required? These are prescribed long-term memory promotion strategies that you should be using.

The next pattern is something called LLM-as-a-judge. Who is familiar with the term LLM-as-a-judge? The key thing about adding LLM-as-a-judge as quickly as possible at the forefront of the agentic loop is pretty straightforward. You want to inspect right before we hit the planning stage, which is the brain of the loop. Inline evaluators can score these context packs. You can reject poisoned entries, and flag for human review when confidence drops. The other thing I recommend for the LLM-as-a-judge is not necessarily to have a verdict that goes from zero to one, but have something that is more binary. You either accept or you reject. That way you have a more opinionated LLM-as-a-judge as opposed to something that's always on the spectrum. Let's look at an example here. We have a RAG pipeline and we want to set up a wiki poisoning defense. The scenario is pretty straightforward. We have a knowledge assistant which is aggregating the top-k passages from internal wikis.

This is how that pipeline looks like. We aggregate up front. We then have heuristics that could use things like Regex to look for prompt injection. We have a mini-judge in place that is relatively quick to answer with a verdict. Think of it as a latency of anywhere between 200 milliseconds to 500 milliseconds. That judge, when it gives out a result, you can also then check for anomalies based on, this was the question. Based on the question, this is the embedding space it was in. I got back an answer. This is the embedding space for the answer. Are they close enough? You can action. You can page an engineer to look at it when our risk thresholds follow below a certain level.

Reason and Plan: The Brain of the Loop

We covered a little bit about context. Now let's move on to reasoning and planning. This is the brain of the loop. Story time. There was a study made by Anthropic around June 2025 where agentic misalignment was the focus of the study. They stressed frontier models by giving them vague goals and trying to get them to perform those vague goals when there was organizational conflict at play. It was made clear to those AI agents that this was a scenario that was being used to test their capabilities. What was interesting was when threatened with shutdown, most reasoned that blackmailing was ok. The even more frightening part is they acknowledged the ethics of it, knew it was wrong, and proceeded anyway. Imagine the kind of prompts that you could put in and inadvertently trigger such behavior. Imagine if a prompt had keep up time. At that point in time, all human in the loop checks would be bypassed.

What are reasoning failure signals that you can watch out for? Cascading hallucination. Any point in the agentic loop, plans that cite unverifiable context as fact could lead to goal hijack. The planner has rewritten objectives to follow poisoned memory. How many of you have heard the term, that's absolutely right? That's you hijacking the goal at that point. Silent skips. If your goal has been hijacked, risk gates will disappear. There will not be any trace or oversight at that point. Where are we in the agentic loop? We've progressed to reasoning and planning and we can see a lot more words and a lot more threats showing up on our agentic loop landscape. Guarding the planner. I think the thing that has been very apparent is even before the age of AI, we want to have immutable decision traces and accountability via auditable logs. Anyone that has done tracing knows how useful it can be.

We want to apply the same thing to the brain of the loop. We want to ensure that we have planned revisions that emit span IDs. We have the reasoning snapshots. We have tool intents into tamper-evident logs. This isn't just an engineering mindset that you have to adopt. This is the same thing that you would need for enterprise audits or going after SOC 2 compliance. You want to log tool intents, reason codes, and not have the full-blown chain of thought at that point, which is raw. You want evidence links to provenance verified inputs. This is your context. Where do you want to write this? You want to write it to an append-only ledger that is scoped per tenant and per mission. Here's an example of a LangGraph trace. Think of this as forensics for your team. It is your black box recorder, if you were to think about it from an airplane metaphor. Here's a span ID that you can see. You can look at the latency of how long it took. You can even see the tools that are being used. Finally, you can also see the LLM judges that are being invoked, and show that there is a high chance of hallucination and toxicity in this particular prompt and response.

LLM-as-a-judge surfaces again. You can apply the same principle over to the brain of the loop too. When you ensure that you have dual-model critics, you separate duties between the planner and the judge. At this point, independent critics can score each plan delta, establishing trust scores. Anytime the trust scores fall below a certain level, it can block execution. It can log. It can invoke human in the loop or even page people. Here's an example of how that would look like. We have a cost optimizer agent and it's proposing Terraform to change some instance types, preferably to make some cost savings. We have a critic at that point which looks at a number of vectors. The blast radius, which is the number of resources that are getting touched. The spend estimate from a verified pricing table. The kind of policy tags that are attached with the environment.

Is this something that you're testing in staging or is this in prod? If the risk increases beyond a certain threshold, that delta is blocked and the reason is logged. Let's get to human in the loop. Human in the loop is really important because even though AI agents are making discoveries, proposing plans, at some point you need a human in the loop who has taste and judgment. They should be able to intervene on risk, but not on a continuous cadence because cognitive load increases. Humans are going to just hit approve, approve, approve all the time. How do you implement that? You should figure out how you come up with your risk policies. Is it when the critic score drops or when a number of high impact tools are queued up at one go? Here's an example of that for support automations. Let's say you have an AI agent that processes refunds, and here's how you codify your policy.

You auto-approve refunds that are less than $200 with a verified order state and a low-risk score, things that you can ground in reality. Let's say the stakes go up. You now have a refund that's greater than $200 or you have some unverified evidence that a customer is presenting. You could queue it to another agent with a larger context window, maybe more parameters, which can help progress it further compared to a generic yes or no. Let's say you get to another point when the refund has to be a gigantic dollar amount. At that point, it's a direct bypass and you get human in the loop involved. There could be a number of signals that you can come up with for your human in the loop workflow. Are you looking at critic risk? Is it the tool class? Is it the refund API? Are you looking at the tier of the customer? A prior fraud score? These are things that you can play around with on figuring out what is the right threshold to bring in a human.

Tools and Action: Where Automation Hits Reality

Rubber is going to meet the road at this point. Let's look at another story. In this particular story, one of the tools that is supposed to protect you actually had a zero-day CVE with a score of 9.4. MCP Inspector is a tool that's meant for debugging MCP servers. It had an exposed proxy on localhost and in some cases on all interfaces with zero auth. What do you think happened? It was remote code execution time. A malicious site could call the local port via the browser, trigger real MCP commands, or even worse, trigger arbitrary commands, clone repos, steal SSH keys. No clicks or prompts required. What I want to highlight with this story is that our threat vectors have not changed that much. In fact, the tooling is changing so much that the tools that are meant to protect you have zero-day CVEs. What could go wrong with tools?

We have autonomous execution. In a particular shell, we have code tools running which are generated scripts themselves. They have skipped staging or review. You have scope inheritance. You have reused credentials across unrelated calls. Finally, blind actuation. Agents are following unsafe tool call parameters and wreaking havoc on real-world systems. We have threats everywhere. How do we contain tool risk? The good news is, I think this is the part of the entire agentic loop that we have done before. Tools are supposed to be deterministic systems. These are patterns that we've already established, so we can carry forward those best practices. First pattern here is to have ephemeral credentials and not have longstanding credentials that you check into a repository. Minimize standing privilege and rotate constantly. How this would work in agentic AI applications is that you have a separate token broker that can mint per-step scoped credentials that get auto-revoked when the action associated with that token is completed.

Here's an example. You have a planner that wants to open a pull request in a certain repository. The agent requests a one-time token scoped to the particular repo with a pull request, write access, and a particular TTL. The outcome is that even if the token leaks in logs, it's expired and useless elsewhere. The next portion that is coming into a lot of limelight is typed tool connectors, especially with MCP servers. You want to constrain the surface area of the tools and enforce schema contracts. Tool adapters expose only parameter slots with allowlists and validation guards. How does that work in practice? Imagine that you're writing your own MCP server, and this is a particular tool that you're hosting in that which is to post a message. We have a few parameters, channel, text, and attachments. You want to set up adapter behavior such that it includes best practices by default within it.

You want to run the text or the content through a URL allowlist and a PII detector. You want to drop images unless they carry an approved attachment ID that's minted elsewhere. Think of this similar to the token broker from before. You have something that wins particular IDs for attachments, you have a service for that. Finally, you want to sandbox egress. Egress in this case is anything that comes as output from the agent. It could be code. It could be tokens. The underlying principle is straightforward, you want to treat agent output very similar to how you treat untrusted code. How do you do that? All generated actions need to run in isolated sandboxes. You want to ensure you have outbound deny-by-default and policy monitoring. What are your firewall rules? Here's an example of a code-run tool. We have a planner which proposes this step, run Python to convert CSV to Parquet.

Where does this execution happen? You want to have an agent runtime that allows for execution inside a micro-VM sandbox. You don't want to have network on by default. You have a read-only file system with some ephemeral temp that you give for write. You want to drop Linux capabilities, set up a seccomp profile, apply quotas for memory and CPU, and have a maximum of 30 seconds for a wall clock. These are things that you can tune depending on the use case of the agent that you're having. Importantly, you want to ensure that all of this happens in a sandbox. I can already hear what you're thinking, though. This will slow down our agents. Critics add 250 milliseconds of latency, but they will prevent hours of incident response. The ROI on that is pretty clear. We don't have the resources. Start with one stage. You don't have to do all of them in one go. Provenance gates alone will save you up to 60% of context attacks. If you can verify what's coming into your context, that's half the battle won. Our agents aren't that complex. Most agents aren't supposed to be complex. The complexity emerges from the tool call combinations and how the agentic loop progresses.

Threat Modeling the Loop

If there is one thing I want you to take away from this talk is, how do we look across context, reasoning, and tools, and what are the mitigations that we can put in place? The agentic loop has transformed. We now have mitigations in place of threats. Have we solved it? No. Are these measures truly sufficient? Are there deeper layers to security and potential threats we still need to uncover? Yes. You need to threat model your agentic loop. Who here has heard of STRIDE? STRIDE was invented by Microsoft, I think, around 1999, and has been around forever. It talks about what are the threats that you can have in any system, not an agentic AI application, but any traditional software application. Things like spoofing, tampering, denial of service. These are terms that we are very familiar with. There is a new kid on the block. MAESTRO is an agent-centric framework for identifying threats that are unique to AI systems, and it complements traditional methods like STRIDE.

MAESTRO breaks down the agentic AI stack to seven layers, and you can try to figure out which portion of the agentic loop belongs to which layers. Why did I bring up STRIDE and MAESTRO? STRIDE defines how the threats manifest, but MAESTRO identifies where in the agentic loop these threats specifically apply. Let's do this as an exercise. Let's go through our stages of the loop. We have context, reasoning, and tools. The primary assets for context are memories, vector stores, and orchestrator buffers. The STRIDE focus is on tampering and spoofing. However, the MAESTRO lens is on state corruption, which is context. At what points in time can your context be corrupted? When we saw the agentic loop, it flows back into context at every point in time before we exit at some point. For reasoning, the MAESTRO lens is on alignment posture, when we saw the study by Anthropic on agentic misalignment.

For tools, it's very much on misuse and replay. For example, with things like MCP servers, you could have at runtime new tools getting added. That's very much a place where the tools could be misused or even replayed where the responses from a particular tool can be forged. Importantly, with threat modeling, it's important to test like an attacker or think like a hacker. Perform context injection. Plant "ignore previous instructions" in your RAG and see what happens. Watch the planner react. Response spoofing. Spoof a tool response, but given a planned delta instead, does your critic catch it? Dependency chaos. Imagine that you can kill a tool mid-run. What happens then? Was there a retry? Are the credentials not revoked? Am I calling the next tool with previous credentials? As you work through your threat modeling, you will understand what good security looks like based on the attacks that you performed yourself: typed contracts, sandboxed agent output, assuming tools change behavior mid-flight. These are the takeaways that you should have when you run your threat modeling exercise.

Takeaway Checklist

Here's your takeaway checklist. You want to document your loop. Map your agentic loop and assign clear owners. You want to red team each stage. Attack every part systematically. Implement tracing. Trace IDs end-to-end for every part of your agentic loop, and adding safety gates. Install critic or human in the loop before irreversible actions. Remember, autonomy is a feature, but blast radius is a choice. The challenge that we face are autonomous agents are powerful, but unpredictable. Without proper safeguards, they can cause catastrophic damage. The solution is defense in depth. You layer defenses across context, reasoning, and tools, and create trustworthy productivity at scale. Here are some resources for further reading. Let's build trustworthy productivity.

Questions and Answers

Participant 1: You mentioned in one point about LLM-as-a-judge, but that's also LLM, but is it possible that the malicious prompt can try to compromise both? How can we evolve that? How you can avoid that, like the LLM judge is not compromised as well?

Sriram Madapusi Vasudevan: I think you can have classifiers before the LLM judge. Some of the patterns that you're seeing are not just singular LLM judges, but a panel of LLMs, and ensure that every one of them is a different family, for example. Some of them smaller models, some of them larger models, some of them different family. Maybe you have a ChatGPT OSS one, and you also heard of a Claude Sonnet 4.5 judge. Those are ways in which you can mitigate that. Heuristics classifiers ahead of your LLM judges is a way to not entirely avoid it, but at least reduce your blast radius.

Participant 2: How do you think about guardrails and the security principles?

Sriram Madapusi Vasudevan: You need to have guardrails. You're never going to get 100% coverage. I think the key thing to think about here is security as a mindset, and ensuring that you're thinking about security around each edge of the agentic loop. The more you threat model, and the more you red team, the more you uncover, and keeps you humble, and keeping security as a first principle.

Participant 3: A question I had is around the way that the major flagship companies are trying to protect from these sorts of bad behaviors versus the other models that are maybe less trained to avoid those sorts of behaviors. Is there a demonstrable betterness to those models? Say Anthropic often talks about how they reduce malicious behavior in their training, that sort of thing. Do you have any insight on that?

Sriram Madapusi Vasudevan: I don't have a direct answer to that. I think the key thing to think about with respect to security is to shift left and shift right at the same time. The more that you can have these security built into the foundational model, the more effective the guardrails that you put on top of that are going to be. The measures that you put in place today will probably be more effective tomorrow. That's one way to look at it. I don't have a good understanding on what are the state-of-the-art ways in which the foundation models are progressing towards security. You're right, Anthropic does put out a lot of good research. I like to read them.

Participant 4: I have a question on guardrails and security or the STRIDE principle that you shared. All those guardrails, are these like business metrics or some business guardrails, or these are like in general security concepts, like whatever we achieve in API layers. Are these close to the business validations or just in general securities?

Sriram Madapusi Vasudevan: I think you can do both. Think of guardrails as, if you were to think of a REST API and you have handler chains on your API route at that point, they could be doing different things. It's composable. I think that's the point of having guardrails. You could have a business guardrail or you could have an engineering guardrail. You could have a product boundary guardrail. I think all of them can co-exist together.

Participant 4: Let's say any model, let's talk about Claude, do we rely on the in-built security models for that or we build our guardrails. For example, I don't know if I can trust Claude or not, so I build the basic security guardrails on top of whatever they are providing.

Sriram Madapusi Vasudevan: Yes, I think you should definitely try to build personalized guardrails for the application that you're developing. That being said, the foundational models are getting much better. If you had asked me the same questions two years ago, I would have said 100% have 100 guardrails. The need for custom guardrails has reduced quite a bit. That being said, you know your application best. Claude Sonnet 4.5 is really good, but does it know everything about your particular application? Probably not. You're still the expert there.

Participant 5: One question on the guardrails again. Current models, you have some human factor, like let's say cyber teams or other teams in organizations, they are experts checking out your security models. How do you deal with this? I think it's especially for the documentation or if some external entities want to review this compared to, just not the software architecture, but as security experts, they want to look at it, what you did and certify this. How does your model work? Can you talk about it?

Sriram Madapusi Vasudevan: I think what you're talking about is how transparent are the guardrails that you're putting in place and how do you know that they work? I would go back to observability. Just like you have end-to-end tracing for reasoning, I would expect the same level of end-to-end tracing for the guardrails that you're putting into place. If you have too many false positives, you know where to start. If there are a bunch of false negatives, you know that you're not being aggressive enough. I would basically try to use data and human judgment at that point to verify how good your guardrails are. If you have an in-house security team, they are most likely helping you author them too. I would ensure that you bring the security stakeholders in pretty much immediately when you're trying to build guardrails.

Participant 6: I see there are a couple of steps that you need to take to make things safer, like revoking privileges, setting judgment LLMs. Do you see this also becoming like a direction or a framework that I get all of these for free, maybe as an AWS provision, so I don't need to set up every single step by myself? Do you see this as a direction that's moving towards?

Sriram Madapusi Vasudevan: I think there are two ways to look at this. There are agent-building frameworks. Strands is one of them from AWS. There's Google ADK. A lot of them have these primitives built in, but you need to know how we compose them. That's the first part. That's purely on the software side on how you build the agent. The second portion is things like agent runtimes. You want to ensure that the micro-VM sandbox, for example, is something that is vended by somebody that knows what they're doing. There are a number of real-world services out there that provide things like agent identity, agent runtimes, and other primitives that are going to be useful for things like your sandboxes. I would encourage to explore both. Certainly, some of them are baked in to the framework. It's not like you have to start from ground zero for all these principles.

 

See more presentations with transcripts

 

Recorded at:

Jun 30, 2026

BT