AI Engineer Summit 2024

Continual System Prompt Learning for Code Agents

Aparna Dhinakaran from Arize demonstrates how to improve coding agents by 5-15% using English feedback loops—a practical alternative to reinforcement learning that requires only 150 examples.

Aparna Dhinakaran

•

December 2024

Watch on YouTube

Executive Summary

While everyone focuses on frontier coding models, Aparna Dhinakaran reveals the hidden secret to successful agents: continual iteration on system prompts. Her team at Arize achieved 5-15% performance improvements on Claude and Cursor using only 150 examples from SWE-bench.

This talk introduces "System Prompt Learning"—a practical alternative to reinforcement learning where agents receive English-language explanations of failures instead of scalar rewards. It's more sample-efficient, requires no data science team, and delivers significant improvements through better evaluation prompts.

Benchmark Results: 150 Examples, 5-15% Improvement

Claude (Sonnet 4.5)

150 examples from SWE-bench

30%

Baseline

45%

After Learning

+15% improvement

Cursor (Cloud Code)

150 examples from SWE-bench

40%

Baseline

45%

After Learning

+5% improvement

Key Insight: These improvements were achieved on the most powerful coding agents using only 150 training examples—demonstrating exceptional sample efficiency compared to traditional RL approaches that require thousands of examples.

Key Insights

Core principles and discoveries from system prompt learning research

System Prompts Are Living Documents, Not Static Files

The most successful coding agents—Claude, Cursor, Clyde—don't use static system prompts. They're continuously iterated on based on real-world feedback. This dynamic iteration is the hidden secret to their success, not just the underlying model.

Notable Quotes:

0:42

"What's not so obvious is how much time is actually spent on the system prompts for those building these coding agents."

1:13

"What's not as obvious is these actually aren't just static. They are repeatedly iterated on. And it's such an important piece of context that actually goes into making these coding agents the most successful agents out there."

1:20

"Just the length of the actual system prompt for each one of these."

English Feedback Outperforms Scalar Rewards

Unlike reinforcement learning, which only provides a score (70%, 80%, 90%), system prompt learning provides detailed natural language explanations of failures. This is like a student getting back a test with teacher comments instead of just a grade—far more efficient for learning.

Notable Quotes:

2:34

"[In RL] they have to figure out almost blindly just with that score how to actually improve their score on the next exam."

3:33

"Except in this case [prompt learning], what actually gets outputted isn't just the score... but you also get back some kind of English feedback. Why did they get this answer right? What did they mess up on? Here's concepts that they missed on, what do they need to go study?"

1:45

"It almost feels like humans learning because they take back English feedback and use that to actually iterate on what they should do differently the next time."

150 Examples Can Deliver 5-15% Improvements

Using only 150 examples from SWE-bench Light, Arize achieved a 15% improvement on Claude (from 30% to 45% issues resolved) and a 5% improvement on Cursor (from 40% to 45%). This sample efficiency dwarfs traditional reinforcement learning approaches that require thousands of examples.

Notable Quotes:

8:39

"On 150 examples we were able to get Cloud code up by 5% more GitHub issues resolved, client um you know 15%."

4:42

"On vanilla Claude Sonnet 4.5 it was about 30% of the GitHub issues actually resolved. Cloud code it was about 40% of the GitHub issues resolved."

8:49

"And this was literally on I think the key thing is like 150 examples of just training data that was used."

Evaluation Quality Is the Critical Differentiator

When compared to DSPy's GEA (a similar approach), the key difference wasn't the algorithm—it was the investment in high-quality evaluation prompts that generated actionable explanations. Good evals are how you get the best insights into improving your agents.

Notable Quotes:

7:12

"Writing really good evals is I think... how you get the best kind of insight into what you could do to improve your agents."

10:06

"The key thing that was really different here was we spent a lot of time actually developing and iterating on the evals and the eval prompts really mattered to making sure that you gave really good explanations back to the agent."

7:22

"And then this is the key part. We actually asked for an explanation. Why did it actually mess up?"

RL May Be Overkill for Most Teams

Reinforcement learning works but is sample-inefficient, time-intensive, and data-hungry. It requires a full data science team. For teams building agents on top of already-capable LLMs, system prompt learning offers a more practical path that doesn't require fine-tuning or massive compute resources.

Notable Quotes:

2:47

"RL works, don't get me wrong, amazing in so many concepts and domains, but it can be... a long path to actually figure out what the right solution is."

3:02

"Some of the things that we've noticed is that it can be sample inefficient. It takes a lot of data to get what you want. It's time intensive. It's data hungry."

3:13

"It might be overkill for teams who are trying to build agents because LLMs are already so good."

The System Prompt Learning Pipeline

Arize's approach creates a feedback loop where agents learn from their mistakes through English explanations, not just pass/fail signals.

1. Generate Solution

Agent receives coding problem from SWE-bench and generates a patch solution

2. Run Unit Tests

Execute the generated solution against unit tests to get pass/fail result

3. LLM-as-Judge Evaluation

Pass the problem, solution, and test results to an LLM judge for detailed analysis

4. Extract Explanations

LLM provides categorized explanations: parsing errors, library-specific issues, concept gaps

5. Meta Prompt Synthesis

Combine original prompt, evaluation results, and explanations into a meta prompt

6. Generate New Rules

Meta prompt outputs improved rules that are appended to the system prompt

💡 The Critical Step:

The LLM-as-a-Judge evaluation is the key innovation. By asking for detailed explanations of failures ("Why did it mess up? What concepts were missed?"), the meta prompt synthesis step generates actionable rules that actually improve agent behavior.

Reinforcement Learning vs System Prompt Learning

Dimension	Reinforcement Learning	System Prompt Learning
Reinforcement Learning	Scalar score only (70%, 80%, 90%)	Sample inefficient - needs thousands of examples
System Prompt Learning	English feedback + scalar score	Sample efficient - 150 examples sufficient

Takeaway: For teams building agents on top of already-capable LLMs, system prompt learning offers a more practical path. It's faster, more sample-efficient, and doesn't require specialized ML expertise.

Practical Takeaways for AI Engineers

Invest in Evaluation Prompts

The quality of your LLM-as-a-judge prompts matters more than the underlying algorithm. Spend time crafting prompts that generate actionable, specific explanations.

Start Small, Iterate Fast

You don't need thousands of examples. Arize achieved significant improvements with just 150 examples from SWE-bench Light.

Categorize Your Failures

Group errors into categories (parsing errors, library-specific issues, concept gaps) to generate more targeted improvements.

Focus on English Feedback

Natural language explanations are more valuable than scalar rewards. Ask 'why did it fail?' not just 'did it pass?'