Benchmarking AI Agents on Kubernetes

Brandon Foley published a benchmarking study on the CNCF blog showing that AI coding agents can find and fix isolated bugs. However, they often struggle to understand system-wide impacts. This challenges the idea that improved code retrieval is the main way to enhance automated bug fixing.

The author integrated AI coding agents into his daily workflow and ran the experiment to understand how well these tools actually perform on real-world bugs. He opened pull requests from the Kubernetes repository as a benchmark. These were real bugs, actively fixed by real contributors. Each agent only received the issue description, with no PR description or diff to suggest a solution.

Three agent configurations were tested against nine Kubernetes bug reports spanning kubelet, scheduler, networking, storage, and apps subsystems. The first used RAG-only retrieval via KAITO RAG Engine backed by Qdrant, combining BM25 keyword matching with embedding-based semantic search. The second took a hybrid approach, requiring RAG-first discovery followed by local filesystem access. The third relied entirely on a local clone of the repository with no retrieval index. All sessions ran the same model (Claude Opus 4.6), the same five-minute timeout, and the same output format; the only variable was how each agent could see code.

AI Agent Configurations for This Benchmark

On speed and cost, the results were clear. RAG-only was consistently the fastest at an average of 76 seconds, since it skips filesystem navigation entirely and generates from retrieved snippets. Hybrid was the slowest at around two and a half minutes on average, as the mandatory RAG-first phase adds overhead before local exploration begins. On token economics, Hybrid proved the most expensive, not because it reads more code, but because it makes the most model invocations, and since the API is stateless, every call replays the full conversation history. Across all runs, call count was the biggest driver of both cost and latency.

On correctness, though, the picture is more nuanced. The dominant failure mode was not incorrect fixes but incomplete ones. Agents addressed the "main" bug while overlooking adjacent changes, fixing one sort of implementation but neglecting a second, patching the core issue but omitting necessary adjustments in dependent integration logic. Or halting upon encountering a partial fix already present in the codebase. The common pattern was that agents don't ask, "What else needs to change?" They stop once the immediate issue appears resolved.

A secondary pattern emerged around architectural choices. When given a choice, agents tend to introduce new abstractions rather than reuse existing ones. On one test case, the correct fix used an existing RestartCount field; all agents instead introduced a new Attempt field, functionally correct, but architecturally heavier.

The research indicated that retrieval strategy influences discovery, but not the quality of reasoning. Mandating RAG utilization enhanced outcomes in certain instances by compelling the agent to identify the pertinent policy evaluation layer prior to executing a remedy, resulting in a superior architectural decision. However, once the pertinent code was identified, the agent continued to reason locally; retrieval aids navigation but does not facilitate comprehension of system-wide ramifications.

Perhaps the most actionable finding is about issue quality. Well-specified bug reports that name the exact file, function, and expected behavior caused all three approaches to converge to high scores, flattening the performance differences between retrieval strategies entirely. The implication is that the quality of the human-written issue description is a stronger lever than the choice of retrieval architecture.

The study finds that scope discovery is a key challenge for AI agents. This means identifying all parts that need change, not just the one that seems broken. This issue remains a major hurdle for AI operations at scale. Structured agent skills or curated playbooks might improve system-level reasoning. However, in large codebases, these skills require constant maintenance to stay aligned with the repository. This creates an additional system to manage rather than providing a one-time fix.

About the Author

Claudio Masolo

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Claudio Masolo

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter