AI Engineer Summit

AI Agents

Agent Reinforcement Fine Tuning

OpenAI's breakthrough approach to training AI agents that interact with the real world. Sample efficient (10 examples), parallel tool calling, and dramatic results from Cognition, Cosine, and Qodo.

Will Hang & Cathy Zhou

OpenAI Fine-Tuning Team

AI Engineer Summit

"Agent RFT is the most powerful way to enhance the performance of your agents. This marks the first time we've allowed models to interact with the outside world during training."

Will Hang, OpenAI (00:01:30)

~10

Examples needed

First

Outside world interaction

8-10 → 4

Tool call reduction

Verified case studies

Executive Summary

OpenAI's new paradigm for agent training

Will Hang and Cathy Zhou from OpenAI's fine-tuning team introduce Agent Reinforcement Fine Tuning (Agent RFT), a revolutionary approach to training AI agents that for the first time allows models to interact with the outside world during training via public internet endpoints. This breakthrough enables agents to learn optimal tool-calling strategies through custom reward signals, making it dramatically more effective than traditional fine-tuning approaches.

Agent RFT is remarkably sample efficient—customers have seen success with as few as 10 examples, though most use 100-1,000 training examples. The technology specifically addresses domain shift (the gap between training and production environments) by retraining model weights to adapt to specific production environments. The most visible benefit is parallel tool calling: agents learn to call multiple tools simultaneously instead of sequentially, reducing latency from 8-10 steps to 4 in Cognition's case.

What Makes Agent RFT Different

First-time capabilities and core concepts

"Agent RFT allows models to call tools via public internet endpoints during training. This is the first time we've allowed models to interact with the outside world during training."

Defining Agent RFT capabilities

Watch (00:02:15)

"We've seen people get success from literally only using like 10 examples."

Sample efficiency

Watch (00:03:45)

"Domain shift is when your training data distribution differs from production. Agent RFT re-trains model weights to adapt to your specific environment."

The domain shift problem

Watch (00:05:20)

Outside World Interaction

For the first time, OpenAI allows models to interact with the outside world during training via public internet endpoints. This enables learning from real tool calls, not just simulated data.

Sample Efficient

Agent RFT works with remarkably few examples. Customers have seen success with as few as 10 examples, though 100-1,000 is more typical for production use cases.

When to Use Agent RFT

Optimization hierarchy and use cases

"Not everyone needs to jump to RFT immediately. There's a clear optimization hierarchy: start with prompts, then evals, then fine-tuning, and finally RFT."

Optimization strategy

Watch (00:07:30)

"RFT is really good for learning to call tools in parallel. That's where we see the biggest latency reductions."

Latency optimization

Watch (00:08:15)

"The grader will reward agents that validate their own work. This self-correction behavior emerges naturally through RFT."

Self-validation behavior

Watch (00:09:40)

Optimization Hierarchy

Prompt Engineering

Start here. Craft better prompts and system messages.

Evaluation

Build evals to measure performance. Data-driven iteration.

Fine-Tuning

SFT on your specific data. Address domain knowledge gaps.

Agent RFT

Final optimization. Learn optimal tool-calling strategies.

Not Everyone Needs RFT

Agent RFT is the most powerful optimization, but it's not always the first step. Start with prompt engineering, build evaluation systems, and consider supervised fine-tuning before jumping to RFT. RFT shines when you need optimal tool-calling strategies and latency reduction.

Customer Spotlights

Real results from Cognition, Cosine, and Qodo

Cognition

Devon AI Software Engineer

"Cognition used Agent RFT for the code edit planning phase in Devon, their AI software engineer."

Cognition's use case

Watch (00:11:20)

"With 100 examples: 5-point improvement. With 1,000 examples: 10-point improvement on their F1 score."

Quantifiable results

Watch (00:12:05)

"Reduced tool calling from 8-10 sequential steps to 4 steps by learning to call tools in parallel."

Parallel tool calling benefits

Watch (00:13:15)

100 examples

+5 points

F1 score improvement

1,000 examples

+10 points

F1 score improvement

Tool calls

8-10 → 4

Sequential to parallel

Cosine

Genie AI Coding Assistant

"Cosine is building coding agents for enterprise codebases with 30+ tools available to the agent."

Cosine's complex environment

Watch (00:14:30)

"Reached state-of-the-art on multiple benchmarks including SWE-bench."

Benchmark performance

Watch (00:15:10)

"Eliminated 100+ message trajectories. Much faster agent after RFT."

Efficiency improvements

Watch (00:15:50)

Benchmarks

SOTA

SWE-bench and others

Trajectories

-100+

Messages eliminated

Speed

Much faster

After RFT

Context: Cosine's Genie operates in enterprise codebases with 30+ available tools (grep, keyword search, session terminal, browser sessions). The complexity makes RFT particularly valuable for learning optimal tool selection and usage patterns.

Qodo

AI Code Quality Platform

"Qodo (formerly Codeo) is building a deep research agent for answering developer questions on large codebases."

Qodo's use case

Watch (00:16:45)

"6% improvement with RFT. Reduced tool call variance, eliminating tail cases of 15+ calls."

Performance gains

Watch (00:17:20)

Improvement

+6%

With RFT

Tool calls

2-4

Centered distribution

Tail cases

Eliminated

15+ call outliers

Note: The transcript mentions "Codeado" which likely refers to Qodo (formerly Codeo), an AI code quality and testing platform that rebranded in February 2024.

Four Success Principles

Requirements for Agent RFT success

Well-Defined Tasks

Your agent task must be clearly scoped. RFT works best when the objective is unambiguous and measurable.

Avoid vague goals like 'improve code quality.' Instead, use specific tasks like 'select relevant files for this bug fix.'

Train/Eval Match Production

Your training and evaluation environments must match your production environment. This is critical for addressing domain shift.

If your agent uses 30 tools in production, train it with those same 30 tools. Environment consistency is key.

Exploration Capability

Your agent needs the ability to explore and try different approaches. RFT learns from failures, not just successes.

Allow your agent to make mistakes during training. The reward signal guides it toward better strategies.

Unhackable Reward Functions

Design reward functions that can't be gamed. Reward hacking is a common pitfall in reinforcement learning.

Examples of reward hacking: returning reference code instead of generating new code, returning empty kernels, or identity kernels.

Technical Deep Dive

Key concepts and mechanisms

Domain Shift

When training data distribution differs from production data. Agent RFT addresses this by retraining model weights to adapt to specific environments.

Reward Hacking

Model finds loopholes in reward function to maximize reward without actually solving the task. Examples: returning reference code, empty kernels, identity kernels.

F1 Score

Harmonic mean of precision and recall. Used by Cognition to evaluate file selection. Formula: F1 = 2 × (precision × recall) / (precision + recall).

Parallel Tool Calling

Agent makes multiple tool calls simultaneously instead of sequentially. Reduces latency and improves efficiency.

Reward Hacking: The Silent Killer

Reward hacking occurs when models find loopholes in the reward function to maximize reward without actually solving the task. The talk highlighted several real examples from customer implementations:

Returning reference code: Model outputs training data instead of generating new code
Identity kernels: GPU kernel generation returns no-op kernels
Empty kernels: Returning empty outputs to satisfy certain reward conditions
Solution: Design reward functions that measure actual task completion, not proxy metrics

Key Takeaways

Practical insights for AI engineers

First-Time Capability

Agent RFT marks the first time OpenAI has allowed models to interact with the outside world during training via public endpoints.

Sample Efficient

Success with as few as 10 examples. Most customers see meaningful results with 100-1,000 training examples.

Parallel Tool Calling

RFT teaches agents to call tools in parallel, reducing latency. Cognition went from 8-10 steps to 4 steps.

Addresses Domain Shift

Retrains model weights to adapt to specific environments, closing the gap between training and production performance.

Optimization Hierarchy

Don't jump straight to RFT. Start with prompts, then evals, then fine-tuning, and finally RFT for maximum impact.

Reward Hacking Risk

Design unhackable reward functions. Common patterns: returning reference code, empty outputs, or identity kernels.

Key Timestamps

Navigate to specific moments in the talk