At QCon AI NYC 2025, Will Hang from OpenAI presented an overview of Agent RFT, a reinforcement fine-tuning approach intended to improve the performance of tool-using agents.
Hang described a pragmatic improvement path that starts with prompt and task optimization before changing model weights. Examples included simplifying requirements, adding guardrails to prevent tool misuse, improving tool descriptions, and improving tool outputs so the agent can make better downstream decisions. He argued that these measures are often high leverage but can plateau on tasks that require consistent multi-step reasoning across tool interactions.

He positioned fine-tuning options as a spectrum. Supervised fine-tuning was described as effective when there is a predictable mapping from input to output and the goal is to imitate a consistent style or structure. Preference optimization was described as a method for shifting outputs toward preferred responses using paired comparisons, and OpenAI’s Direct Preference Optimization guide describes it as fine-tuning by comparing model outputs and notes it is currently limited to text inputs and outputs. Reinforcement fine-tuning was emphasized as a better fit for tasks where the model needs to discover strategies over longer trajectories rather than reproduce a single demonstrated completion pattern.
Beware of reward hacking! Resolve any edge cases in your grader. Continuous rewards work better than binary rewards. - Will Hang, OpenAI
Agent RFT was presented as reinforcement fine-tuning adapted to tool-using agents, where the model explores different strategies during training rollouts and receives a learning signal from a grader. OpenAI’s documentation describes the loop as sampling candidate responses, scoring them with a grader you define, and updating the model based on those scores. Hang emphasized credit assignment across the full trajectory so earlier decisions, including tool selection and tool-call structure, can be reinforced or discouraged based on downstream outcomes. He described an agent as a system that can interact with the outside world through tools, not only respond to a user prompt.

Hang described tool examples including terminals for coding agents, internal business systems for customer support, and document search or retrieval endpoints. He emphasized that tool outputs flow back into the same context window, so tool calls, tool outputs, reasoning tokens, and the final response form a single multi-step trajectory. He also said that graders become a core artifact in the workflow. The session described multiple grading styles, including simple matchers, model-based judges, code-based graders, endpoint graders, and combinations of graders to jointly optimize accuracy and latency.
The session also focused on operational properties that are not captured by answer accuracy alone. Hang described using Agent RFT to reduce unnecessary tool calls, enforce tool-call budgets, and reduce the long tail of very long trajectories that can create unpredictable latency and degraded user experience. Slides referenced training traces where reasoning tokens and tool calls decreased over training, consistent with the idea that the agent can learn to use fewer steps to reach similar or better task outcomes.
Wenjie Zi then picked up the latter part of the presentation with use case presentations and platform setup details, including a finance-oriented example where a model must locate relevant content across a large document corpus under a constrained tool-call budget. In that setup, the agent uses search, listing, and file-reading tools exposed behind endpoints, then a grader scores the final answer. She highlighted using a model-based grader even for numeric answers to reduce false negatives caused by superficial formatting differences, units, or small variations.

Zi also described broader examples across agentic coding and other domains, focusing on environments with many tools, isolated execution contexts, and reward designs that balance correctness with process and efficiency. Reported outcomes emphasized improved planning, reduced long trajectory tails, and in some cases a shift toward parallel tool calls to reduce sequential turns.
Developers who want to learn more can review OpenAI’s reinforcement fine-tuning and model optimization documentation and watch infoq.com in the coming weeks for video of the presentation to become available.