BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Stateful Continuation for AI Agents: Why Transport Layers Now Matter

Stateful Continuation for AI Agents: Why Transport Layers Now Matter

Listen to this article -  0:00

Key Takeaways

  • Agent workflows make transport a first-order concern. Multi-turn, tool-heavy loops amplify overhead that is negligible in single-turn LLM use.
  • Stateless APIs scale poorly with context. Re-sending the full history each turn drives linear payload growth and increases latency.
  • Stateful continuation cuts overhead dramatically. Caching context server-side can reduce client-sent data by 80%+ and improve execution time by 15–29% .
  • The benefit is architectural, not protocol-specific. Any approach that avoids retransmitting context can achieve similar gains.
  • Performance comes with trade-offs. Stateful designs introduce challenges in reliability, observability, and portability that must be weighed carefully.

The Airplane Problem

On a recent flight, I purchased the in-flight internet and tried to use Claude Code. The agent needed to read several files, understand the codebase structure, make edits, and run tests; a typical agentic workflow involving 10-15 tool calls. But the internet was so bad that by the third or fourth turn, the requests were timing out. Each turn was resending the entire conversation history — the original prompt, every file it had read, every edit it had proposed, every test output — and the payload had ballooned to hundreds of kilobytes. Over a bandwidth-constrained link, that growing payload was a bottleneck.

This experience highlighted something that's becoming increasingly relevant as AI coding agents mature: the transport layer matters more for agentic workflows than for simple chat. A single-turn chat completion sends a prompt and gets a response. An agentic coding session involves 10, 20, or sometimes 50+ sequential turns in which the model reads code, proposes changes, runs tests, reads error output, fixes issues, and iterates. With each turn, the conversation context grows, and over HTTP, that entire growing context must be retransmitted every time. 

In February 2026, OpenAI introduced WebSocket mode for their responses API, which caches the conversation history in the server memory to solve this problem; I was excited to try it out and see how it performs compared to HTTP.

The Agentic Coding Loop

AI coding agents have moved from novelty to daily workflow for many organizations, especially since December 2025. Tools like Claude Code, OpenAI Codex, Cursor, and Cline now routinely perform multi-file edits, run test suites, and iterate on failing builds. OpenAI reports over 1.6 million weekly active users on Codex alone, with a typical engineer on the Codex team running 4-8 parallel agents.

The core of these agents is the "agent loop": a cycle of model inference and tool execution that repeats until the task is complete:

The coding agent loop: At every turn, the model either returns a response indicating task completion, or recommends tool calls, whose response is fed back to the model inference until the task is complete

A single turn of the agent loop typically involves reading several files to understand the codebase, editing some files, and running tests, which involves 10-15 tool calls, often more for complex refactoring. The results of those tool calls are then sent to the LLM inference server. If the problem is solved, the LLM server returns a response with no more tool calls. Otherwise, the LLM server recommends additional tool calls, which starts the next turn of the agent loop, and this process continues until the problem is solved. Each turn requires the model to receive the full context of what's happened so far.

The HTTP Overhead Problem

With HTTP-based APIs, including OpenAI's Responses API over HTTP and the older Chat Completions API, each turn is a stateless request. The server doesn't remember what happened on the previous turn, so the client must resend everything:

  • System instructions and tool definitions (~2 KB)
  • The original user prompt
  • Every prior model output (including full code blocks that the model wrote)
  • Every tool call result (including file contents, command outputs)

This means the request payload grows linearly with each turn. In our benchmarks, we measured the actual per-turn bytes sent by the client over HTTP versus WebSocket:

Average bytes sent per turn across 10 task runs with gpt-4o-mini. HTTP grows linearly; WebSocket stays constant.

By turn 9, HTTP is sending nearly 10x as much data per request as WebSocket. This is because OpenAI's WebSocket mode for the Responses API keeps a persistent connection with server-side in-memory state. After the first turn, each subsequent turn sends only:

  • A previous_response_id referencing the cached state (~60 bytes)
  • The new tool call outputs (typically 1-3 KB of file content or command output)

The payload stays roughly constant regardless of how many turns deep you are.

What Existing Benchmarks Show

Before building our own test harness, we reviewed publicly available data.

OpenAI's claim: WebSocket mode for the Responses API is built for low-latency, long-running agents with heavy tool calls. For workflows with 20+ tool calls, it delivers up to 40% faster end-to-end execution by eliminating redundant context re-transmission and leveraging server-side in-memory state persistence across turns.

Cline's independent validation: The Cline team tested WebSocket mode with GPT-5.2-codex against their standard HTTP API integration and reported:

  • ~15% faster on simple tasks (few tool calls)
  • ~39% faster on complex multi-file workflows (many tool calls)
  • Best cases hitting 50% faster
  • WebSocket handshake adds slight TTFT overhead on the first turn, but it amortizes fast

The pattern: The speedup scales with workflow complexity. Simple tasks with 1-2 tool calls see minimal benefit (or even slight overhead from the WebSocket handshake). Complex tasks with 10+ tool calls see dramatic improvements because the cumulative savings from not retransmitting context compound with each turn.

Our Benchmark: Validating the Claims

To validate these claims with controlled measurements, we built a benchmark harness that simulates realistic agentic coding workflows against OpenAI's Responses API. The harness is open source.

Methodology

We defined three coding tasks of varying complexity:

  1. Fix a failing test — Read the test file, read the component, fix the bug, run tests (~10-15 turns, 12-17 tool calls)
  2. Add a search feature — Read existing components, implement the feature, run tests (~5-15 turns, 4-21 tool calls)
  3. Refactor the API layer — List the project, read files, search for callers, update multiple files, run tests (~6-11 turns, 10-20 tool calls)

Each task uses simulated tool responses (realistic file contents, test outputs, command outputs) to isolate transport-layer differences. The model makes real API calls to OpenAI and decides which tools to call and when to stop — the non-determinism is in the model's behavior, not the tool responses.

Two test configurations:

Cell Approach Per-turn behavior
1 HTTP Responses API Full conversation context is re-sent every turn
2 WebSocket Responses API previous_response_id + incremental input only

We measured:

  • TTFT (Time to First Token): How quickly does the model start generating on each turn?
  • Bytes sent: How much data does the client upload per task?
  • Bytes received: How much streaming event data comes back?
  • Total time: End-to-end wall-clock time for the full agentic workflow

Each configuration was run 3 times and aggregated. We tested with two models — GPT-5.4 (a frontier coding model) and GPT-4o-mini (a smaller, faster model) — to see whether the transport-layer effects hold across model sizes.

Results

Across all runs, tasks averaged roughly 8-11 turns and 9-16 tool calls per task, varying by model and transport mode.

Relative performance (WebSocket vs HTTP):

Metric GPT-5.4 GPT-4o-mini
Total time 29% faster 15% faster
Bytes sent 82% less 86% less
First-turn TTFT 14% lower ~same

Detailed results for GPT-5.4:

Metric GPT-5.4 HTTP GPT-5.4 WebSocket Delta
Avg total time/task 40.8 s 28.9 s −29%
Avg TTFT (all turns) 1,253 ms 1,111 ms −11%
Avg TTFT (first turn) 1,255 ms 1,075 ms −14%
Avg bytes sent/task 176 KB 32 KB −82%
Avg bytes recv/task 485 KB 343 KB −29%

Key Findings

  1. WebSocket consistently reduces client-sent data by 80-86%. This is the most reliable finding, independent of model, API variance, or task complexity. HTTP sends 153-176 KB per task; WebSocket sends 21-32 KB. This is a direct consequence of not retransmitting the growing conversation history.
  2. WebSocket delivers 15-29% faster end-to-end execution. With GPT-5.4, WebSocket was 29% faster — roughly consistent with Cline's reported 39% on complex workflows. The speedup comes from a combination of less data to upload per turn and potentially faster server-side processing (no need to re-parse and tokenize the full context).
  3. First-turn TTFT is similar across approaches. The WebSocket handshake doesn't add meaningful overhead — first-turn TTFT was within noise of HTTP for both models. The advantage emerges in continuation turns, where WebSocket avoids the growing payload upload.
  4. The effect is model-independent. We ran the same benchmarks with GPT-4o-mini (detailed results in the repo) and saw consistent bytes-sent savings (86%) and 15% faster end-to-end execution. The time savings were larger for GPT-5.4 (29% vs 15%), likely because the frontier model generates longer responses that accumulate more context per turn.

Why It's Faster: The Architecture

The performance difference is a direct consequence of eliminating redundant data transmission.

HTTP: Stateless by Design

Turn 1: Client → [system + prompt + tools]                    → Server
Turn 2: Client → [system + prompt + tools + turn1 + output1]  → Server
Turn 3: Client → [all of the above + turn2 + output2]         → Server
...
Turn N: Client → [system + prompt + tools + ALL prior turns]   → Server

Each request is independent. The server processes it, returns a response, and forgets everything. The client must reconstruct the full context from scratch.

WebSocket: Stateful Continuation

Turn 1: Client → [system + prompt + tools]      → Server  (server caches response)
Turn 2: Client → [prev_id + tool_output]         → Server  (server loads from cache)
Turn 3: Client → [prev_id + tool_output]         → Server  (server loads from cache)
...
Turn N: Client → [prev_id + tool_output]         → Server  (constant-size payload)

The server keeps the most recent response in connection-local memory. Continuations reference that cached state, so the client only sends what's new.

The Bandwidth Math: From Our Benchmarks

Using our actual GPT-5.4 data for a typical 10-turn coding task:

HTTP total bytes sent (client → server): 176 KB per task (measured average)

  • Grows from 2 KB on turn 0 to 38 KB on turn 9 as context accumulates

WebSocket total bytes sent: 32 KB per task (measured average)

  • Stays flat at 2-4 KB per turn throughout

That's an 82% reduction in client-sent bytes — 144 KB saved per task, compounding across thousands of concurrent sessions.

Architectural Lessons

1. API Compatibility vs Performance: The Protocol Tax

The OpenAI-compatible HTTP API (both the /chat/completions and Responses API) is the de facto standard. Every LLM tool, SDK, and orchestration framework speaks it. But this compatibility comes at a cost: the API is inherently stateless, requiring full context to be retransmitted on every request.

WebSocket mode breaks this compatibility, causing fragmentation.

Who supports WebSocket today?

Provider / Gateway WebSocket API Streaming method
OpenAI Responses API ✅ (since Feb 2026) WebSocket frames (JSON)
Google Gemini API

⛔ (text/coding)
✅ (audio/video)

WebSocket frames
Anthropic Claude API Server-Sent Events (SSE)
OpenRouter SSE (OpenAI-compatible)
Cloudflare AI Gateway ✅ (gateway layer) WebSocket frames
Local models (Ollama, vLLM) SSE

Who supports WebSocket among coding agents?

Coding Agent WebSocket support Notes
OpenAI Codex ✅ (native) Built on the Responses API
Cline ✅ (OpenAI only) First to integrate, reported 39% speedup
Claude Code Uses Anthropic SSE API
Cursor HTTP-based multi-provider
Windsurf HTTP-based multi-provider
Roo Code Cline fork, may inherit support
OpenCode Multi-provider, HTTP-based

WebSocket is currently an OpenAI-only advantage. If your agent needs to switch between providers,  say, using Claude for reasoning-heavy tasks and GPT for speed, you would lose the WebSocket performance benefit on every non-OpenAI call. 

Google's Gemini Live API uses WebSocket, but it's designed for real-time audio/video streaming rather than text-based agentic workflows. Cloudflare's AI Gateway offers a WebSocket endpoint that sits in front of multiple providers, but it proxies to HTTP under the hood and doesn't provide the server-side state caching that makes OpenAI's implementation fast.

2. Protocol Overhead at Scale: When Bytes Per Turn Matter

For a single conversation, the overhead of resending context is negligible. But from the server's perspective, the scale of agentic coding in 2026 makes this significant.

Estimating concurrent sessions for a single major provider: OpenAI Codex has 1.6 million weekly active users. GitHub Copilot has 4.7 million paid subscribers. Claude Code is generating $2.5 billion in annualized revenue, suggesting over 1 million active developers. Cline, Cursor, Windsurf, Roo Code, and OpenCode add millions more. Conservatively, 5-10 million developers are actively using AI coding agents weekly. For a single major provider like OpenAI, assuming 10-20% of its users are active during a peak hour with overlapping sessions, we estimate roughly 1 million concurrent agentic coding sessions at peak.

At that scale, using our measured per-task data:

HTTP: 1,000,000 sessions × 176 KB sent per task = 176 GB of client-to-server payload per 40 second task

WebSocket: 1,000,000 sessions × 32 KB sent per task = 32 GB of client-to-server payload per 40 second task

That's a 144 GB reduction in ingress traffic over a 40-second task, i.e., a 29 Gbps reduction. For a provider processing millions of requests, this reduces load on API gateways, tokenizers (which must re-tokenize the full context on each HTTP request), and network infrastructure. The server-side savings are arguably more important than the client-side savings: less data to receive, parse, and tokenize means faster time-to-first-token for everyone.

3. Server-Side State: The Real Innovation

The key insight is that WebSocket isn't faster because of the protocol — TCP-based WebSocket has similar framing overhead to HTTP/2. The speed comes from server-side state management: the WebSocket server stores the most recent response in connection-local volatile memory, enabling near-instant continuation without re-tokenizing the full conversation.

This has architectural implications:

  • State is ephemeral: It lives only in memory on the specific server handling your connection. If the connection drops, the state is lost (unless store=true).
  • No multiplexing: Each WebSocket connection handles one response at a time. For parallel agent invocations, you need multiple connections.
  • 60-minute limit: Connections auto-terminate after 1 hour, requiring reconnection logic for sessions longer than 1 hour.

For architects designing similar systems, the pattern is clear: if your protocol involves many sequential requests that build on prior context, keeping that context server-side (even if only in volatile memory) can dramatically reduce per-request overhead.

4. The Statefulness Spectrum

Different approaches to the context accumulation problem offer different trade-offs:

Approach State Location Durability Latency Bandwidth
HTTP (stateless) Client only N/A High (grows with context) High (grows with context)
HTTP + store=true Server (persisted) Durable Medium (server rehydrates from persistent store) Low (incremental input)
WebSocket + store=false Server (in-memory) Volatile Low (no rehydration) Low (incremental input)
WebSocket + store=true Server (in-memory + persisted) Durable Low (no rehydration in happy case) Low (incremental input)

The sweet spot for most agentic workflows is WebSocket + store=false: you get the fastest continuations, your data isn't persisted on the provider's servers (important for enterprise compliance with Zero Data Retention policies), and if the connection drops, you restart the task from scratch rather than trying to recover mid-stream.

5. Parallel Execution: Multiple Connections, Not Multiplexing

Each WebSocket connection handles one response at a time — there's no multiplexing. For parallel tasks (e.g., running 4-8 agents simultaneously, as a typical Codex engineer does), you need separate WebSocket connections. The bandwidth savings from WebSocket still apply per-connection, but concurrent connections may hit API rate limits more aggressively than concurrent HTTP requests due to faster execution times.

When HTTP Is Still the Right Choice

WebSocket mode isn't universally better. Use HTTP for:

  • Simple, few-turn interactions: For 1-2 turn interactions, the context retransmission overhead is negligible and doesn't justify the added complexity.
  • Multi-provider support: If you need to switch among OpenAI, Anthropic, Google, and local models, the standard HTTP API is the common denominator. WebSocket mode is currently OpenAI-specific. Adopting it creates provider lock-in.
  • Stateless infrastructure: If your backend runs on serverless functions (Lambda, Cloud Functions) that can't maintain persistent connections, HTTP is your only option.
  • Debugging and observability: HTTP requests are easier to log, replay, and debug with standard tools. WebSocket streams require specialized tooling.

Conclusion

For agentic coding workflows, the move from stateless HTTP to stateful WebSocket connections delivers meaningful performance improvements: 29% faster end-to-end execution, 82% less client-side data sent, and 11% lower TTFT with GPT-5.4 as validated by our controlled benchmarks against the OpenAI Responses API.

But the WebSocket advantage comes with a trade-off: it's currently OpenAI-specific, creating provider lock-in in an ecosystem where developers increasingly want to switch between models. None of the major alternatives — Anthropic's Claude API, Google Gemini, OpenRouter, or local model servers — offer equivalent WebSocket support for text-based agentic workflows.

The takeaway for architects building agentic systems isn't to blindly adopt WebSocket. It's to recognize that as AI workflows shift from single-turn to multi-turn, the transport-layer decisions that were irrelevant for chatbots become material for agents. Any system that avoids retransmitting growing conversation context — whether through WebSocket, server-side session caching, or a custom stateful protocol — will see similar wins. The question is whether the industry converges on a standard for stateful LLM continuation, or whether this remains a provider-specific competitive advantage.

The benchmarking harness and all results are available here.

About the Author

Rate this Article

Adoption
Style

BT