Docker is positioning its Cagent runtime as a way to bring deterministic testing to AI agents, addressing a growing problem for teams building production agentic systems.
As agentic systems become more commonplace, engineering teams are discovering the difficulty arising from testing probabilistic outputs. Traditional enterprise systems are built on a simple assumption: the same input produces the same output. Agentic systems break that assumption, and much of today’s ecosystem has adapted by evaluating variability rather than eliminating it.
Over the past two years, a growing class of evaluation frameworks has emerged to make agent behavior observable and measurable. Tools such as LangSmith, Arize Phoenix, Promptfoo, Ragas, and OpenAI Evals capture execution traces and apply qualitative or LLM-based scoring to judge outcomes.
These tools are essential for monitoring safety and performance, but they introduce a different testing model. Results are rarely binary. Teams increasingly rely on thresholds, retries, and soft failures to cope with evaluator variance. For example, industry coverage of AI agent testing notes that traditional QA assumptions break down for agents because outputs are probabilistic and evaluation often requires more flexible, probabilistic frameworks rather than strict pass/fail assertions.
In parallel, some teams have rediscovered a more traditional approach, targeting repeatability and determinism in testing using the record and replay pattern. Borrowed from integration testing tools like vcr.py, the pattern captures real API interactions once and replays them deterministically in future test runs. LangChain now recommends this technique explicitly for LLM testing, noting that recording HTTP requests and responses can make CI runs fast, cheap, and predictable. In practice, however, this has often remained an external testing concern rather than a first-class part of how agents execute.
Docker’s Cagent follows this example. Architecturally, Cagent uses a proxy-and-cassette model. In recording mode, it forwards requests to real providers such as OpenAI or Anthropic, captures the full request and response, normalizes volatile fields like IDs, and stores the interaction in a YAML cassette. In replay mode, Cagent blocks external calls entirely, matches incoming requests against the cassette, and returns the recorded response. If the agent’s execution diverges, for example, a different prompt, tool call, or sequence — the run fails deterministically.
Cagent is still at an early stage of maturity. Docker’s own GitHub repository describes the project as under active development, with breaking changes expected, and most public examples of its use so far come from Docker’s documentation rather than large-scale production deployments.
Cagent does not replace existing evaluation frameworks, but it highlights a different direction in how agent testing is evolving. While much of today’s tooling focuses on assessing outcomes after execution, Cagent shifts attention to making agent behavior reproducible in the first place. As teams experiment with increasingly complex agent workflows, this distinction is becoming more visible. Deterministic replay does not determine whether an agent’s output is correct, but it does make changes in behavior explicit, offering a foundation for testing that looks closer to traditional software engineering