Continual System Prompt Learning for Code Agents
Aparna Dhinakaran from Arize demonstrates how to improve coding agents by 5-15% using English feedback loops—a practical alternative to reinforcement learning that requires only 150 examples.
Executive Summary
While everyone focuses on frontier coding models, Aparna Dhinakaran reveals the hidden secret to successful agents: continual iteration on system prompts. Her team at Arize achieved 5-15% performance improvements on Claude and Cursor using only 150 examples from SWE-bench.
This talk introduces "System Prompt Learning"—a practical alternative to reinforcement learning where agents receive English-language explanations of failures instead of scalar rewards. It's more sample-efficient, requires no data science team, and delivers significant improvements through better evaluation prompts.
Benchmark Results: 150 Examples, 5-15% Improvement
Claude (Sonnet 4.5)
150 examples from SWE-bench
30%
Baseline
45%
After Learning
Cursor (Cloud Code)
150 examples from SWE-bench
40%
Baseline
45%
After Learning
Key Insight: These improvements were achieved on the most powerful coding agents using only 150 training examples—demonstrating exceptional sample efficiency compared to traditional RL approaches that require thousands of examples.
Key Insights
Core principles and discoveries from system prompt learning research
System Prompts Are Living Documents, Not Static Files
The most successful coding agents—Claude, Cursor, Clyde—don't use static system prompts. They're continuously iterated on based on real-world feedback. This dynamic iteration is the hidden secret to their success, not just the underlying model.
Notable Quotes:
"What's not so obvious is how much time is actually spent on the system prompts for those building these coding agents."
"What's not as obvious is these actually aren't just static. They are repeatedly iterated on. And it's such an important piece of context that actually goes into making these coding agents the most successful agents out there."
"Just the length of the actual system prompt for each one of these."
English Feedback Outperforms Scalar Rewards
Unlike reinforcement learning, which only provides a score (70%, 80%, 90%), system prompt learning provides detailed natural language explanations of failures. This is like a student getting back a test with teacher comments instead of just a grade—far more efficient for learning.
Notable Quotes:
"[In RL] they have to figure out almost blindly just with that score how to actually improve their score on the next exam."
"Except in this case [prompt learning], what actually gets outputted isn't just the score... but you also get back some kind of English feedback. Why did they get this answer right? What did they mess up on? Here's concepts that they missed on, what do they need to go study?"
"It almost feels like humans learning because they take back English feedback and use that to actually iterate on what they should do differently the next time."
150 Examples Can Deliver 5-15% Improvements
Using only 150 examples from SWE-bench Light, Arize achieved a 15% improvement on Claude (from 30% to 45% issues resolved) and a 5% improvement on Cursor (from 40% to 45%). This sample efficiency dwarfs traditional reinforcement learning approaches that require thousands of examples.
Notable Quotes:
"On 150 examples we were able to get Cloud code up by 5% more GitHub issues resolved, client um you know 15%."
"On vanilla Claude Sonnet 4.5 it was about 30% of the GitHub issues actually resolved. Cloud code it was about 40% of the GitHub issues resolved."
"And this was literally on I think the key thing is like 150 examples of just training data that was used."
Evaluation Quality Is the Critical Differentiator
When compared to DSPy's GEA (a similar approach), the key difference wasn't the algorithm—it was the investment in high-quality evaluation prompts that generated actionable explanations. Good evals are how you get the best insights into improving your agents.
Notable Quotes:
"Writing really good evals is I think... how you get the best kind of insight into what you could do to improve your agents."
"The key thing that was really different here was we spent a lot of time actually developing and iterating on the evals and the eval prompts really mattered to making sure that you gave really good explanations back to the agent."
"And then this is the key part. We actually asked for an explanation. Why did it actually mess up?"
RL May Be Overkill for Most Teams
Reinforcement learning works but is sample-inefficient, time-intensive, and data-hungry. It requires a full data science team. For teams building agents on top of already-capable LLMs, system prompt learning offers a more practical path that doesn't require fine-tuning or massive compute resources.
Notable Quotes:
"RL works, don't get me wrong, amazing in so many concepts and domains, but it can be... a long path to actually figure out what the right solution is."
"Some of the things that we've noticed is that it can be sample inefficient. It takes a lot of data to get what you want. It's time intensive. It's data hungry."
"It might be overkill for teams who are trying to build agents because LLMs are already so good."
The System Prompt Learning Pipeline
Arize's approach creates a feedback loop where agents learn from their mistakes through English explanations, not just pass/fail signals.
1. Generate Solution
Agent receives coding problem from SWE-bench and generates a patch solution
2. Run Unit Tests
Execute the generated solution against unit tests to get pass/fail result
3. LLM-as-Judge Evaluation
Pass the problem, solution, and test results to an LLM judge for detailed analysis
4. Extract Explanations
LLM provides categorized explanations: parsing errors, library-specific issues, concept gaps
5. Meta Prompt Synthesis
Combine original prompt, evaluation results, and explanations into a meta prompt
6. Generate New Rules
Meta prompt outputs improved rules that are appended to the system prompt
💡 The Critical Step:
The LLM-as-a-Judge evaluation is the key innovation. By asking for detailed explanations of failures ("Why did it mess up? What concepts were missed?"), the meta prompt synthesis step generates actionable rules that actually improve agent behavior.
Reinforcement Learning vs System Prompt Learning
| Dimension | Reinforcement Learning | System Prompt Learning |
|---|---|---|
| Reinforcement Learning | Scalar score only (70%, 80%, 90%) | Sample inefficient - needs thousands of examples |
| System Prompt Learning | English feedback + scalar score | Sample efficient - 150 examples sufficient |
Takeaway: For teams building agents on top of already-capable LLMs, system prompt learning offers a more practical path. It's faster, more sample-efficient, and doesn't require specialized ML expertise.
Practical Takeaways for AI Engineers
Invest in Evaluation Prompts
The quality of your LLM-as-a-judge prompts matters more than the underlying algorithm. Spend time crafting prompts that generate actionable, specific explanations.
Start Small, Iterate Fast
You don't need thousands of examples. Arize achieved significant improvements with just 150 examples from SWE-bench Light.
Categorize Your Failures
Group errors into categories (parsing errors, library-specific issues, concept gaps) to generate more targeted improvements.
Focus on English Feedback
Natural language explanations are more valuable than scalar rewards. Ask 'why did it fail?' not just 'did it pass?'
Notable Quotes
"What's not so obvious is how much time is actually spent on the system prompts for those building these coding agents."
"What's not as obvious is these actually aren't just static. They are repeatedly iterated on."
"It almost feels like humans learning because they take back English feedback and use that to actually iterate on what they should do differently the next time."
"They have to figure out almost blindly just with that score how to actually improve their score on the next exam."
"But you also get back some kind of English feedback. Why did they get this answer right? What did they mess up on?"
"On vanilla Claude Sonnet 4.5 it was about 30% of the GitHub issues actually resolved."
"Writing really good evals is I think um how you get the best kind of insight into what you could do to improve your agents."
"On 150 examples we were able to get Cloud code up by 5% more GitHub issues resolved, client um you know 15%."
"Many of you guys in this room might be thinking, okay, well, prompt learning is cool, but how does that compare to GEA?"
"The key thing that was really different here was we spent a lot of time actually developing and iterating on the evals."
About the Speaker
Aparna Dhinakaran
Co-founder & CPO of Arize
Aparna Dhinakaran is the Co-founder and CPO of Arize, a leading AI observability and evaluation platform. She's a thought leader in AI evaluation and system prompt optimization, helping enterprises build reliable AI systems. Her work focuses on practical methodologies for improving LLM applications without requiring massive infrastructure or specialized ML teams.