OpenAI recently published the first in a series of articles detailing the design and functionality of their Codex software development agent. The inaugural post highlights the internals of the Codex harness, the core component in the Codex CLI.
Like all AI agents, the harness consists of a loop that takes input from a user and uses an LLM to generate tool calls or responses back to the user. But because of LLM constraints, the loop also has strategies to manage context and reduce prompt cache misses. Some of these strategies were based on lessons learned the hard way: as bugs reported by users. Because the CLI uses the Open Responses API, it is LLM agnostic: it can use any model that is wrapped by this API, including locally-hosted open models. According to OpenAI, their CLI design and lessons can therefore benefit anyone designing an agent based on this API:
[We] highlighted practical considerations and best practices that apply to anyone building an agent loop on top of the Responses API. While the agent loop provides the foundation for Codex, it’s only the beginning. In upcoming posts, we’ll dig into the CLI’s architecture, explore how tool use is implemented, and take a closer look at Codex’s sandboxing model.
The article describes what happens in one round or turn of a user conversation with the agent. The turn begins with assembling an initial prompt for the LLM. This consists of instructions, which is a system message that contains general rules for the agent, such as coding standards; tools, a list of MCP servers that the agent can invoke; and the input, which is a list of text, images, and file inputs, including things like AGENTS.md, local environment information info, and the user's input message. All of this is packaged into a JSON object to send to the Responses API.
This triggers LLM inference, which produces a stream of output events. Some of these events may indicate that the agent should call one of the tools; in this case the agent invokes the tool with the specified inputs and collects the output. Other events indicate reasoning outputs from the LLM, which are typically steps in a plan. Both tool calls and reasoning are then appended to the initial prompt, which is passed to the LLM again for more iterations of reasoning or tool calling. This is called a turn of the "inner" loop. The conversation turn is finished when the LLM responds to the inner loop with a done event, which includes a response message for the user.
A major challenge in this scheme is LLM inference performance: it is "quadratic in terms of the amount of JSON sent to the Responses API over the course of the conversation." This is why prompt caching is key: by reusing the output of a previous inference call, inference performace becomes linear instead of quadratic. Changing things like the list of tools will invalidate the cache, and Codex CLI's initial support for MCP had a bug that "failed to enumerate the tools in a consistent order," which caused cache misses.
Codex CLI also uses compaction to reduce the amount of text in the LLM context. Once the conversation length exceeds some set number of tokens, the agent will call a special Responses API endpoint that provides a smaller representation of the conversation that replaces the previous input.
Hacker News users discussing the article praised OpenAI's decision to open-source Codex CLI, pointing out that Claude Code is closed. One user wrote:
I remember they announced that Codex CLI is opensource...This is a big deal and very useful for anyone wanting to learn how coding agents work, especially coming from a major lab like OpenAI. I've also contributed some improvements to their CLI a while ago and have been following their releases and PRs to broaden my knowledge.
The Codex CLI source code, bug tracking, and fix history are available on GitHub.