7 Habits of Highly Effective GenAI Evaluations
Justin Muller, Principal Applied AI Architect at AWS, reveals the battle-tested 7 Habits framework that transformed document processing from 22% to 92% accuracy. Learn how to build evaluation systems that enable rapid iteration and production deployment.
"The number one thing that I see across all workloads is a lack of evaluations. And in particular, I call it the missing piece to scaling GenAI."Justin Muller, AWS (01:45-01:48)
Document accuracy in 6 months
Target eval runtime
ROI for successful projects
Faster iteration (hundreds/day)
The Evaluation Crisis: Why Most AI Projects Fail
The #1 problem across all GenAI workloads is a lack of proper evaluations. Without evals, teams cannot identify problems, measure progress, or achieve production-ready accuracy.
"When a team comes back and says 'Oh gosh, two hours on eval sounds boring. Can you just give me the toys to play with?' I know right away that's a science project and it's not going to go anywhere."
— Justin Muller, Principal Applied AI Architect at AWS
How to identify projects that will fail vs succeed based on their attitude toward evaluations
06:10-06:17Science Project vs. Production
The attitude difference that predicts success
Science Projects: "Can you just give me the toys to play with?" → Never go anywhere
Successful Projects: "Why don't we spend four hours on eval?" → 100x ROI, 10,000% cost reduction
The Scale Unlock
Why evals are non-negotiable
"When we add evaluations, it unlocks the ability to scale. And we're going to look at how that happens, but it's by far the most common way to unlock scale."
— Justin Muller (02:05-02:09)
Case Study: From 22% to 92% Accuracy
A document processing customer struggled for 6-12 months with 6-8 engineers, achieving only 22% accuracy. The VP was considering cancellation. Adding an evaluation framework transformed everything.
"Once the evaluation framework was in place and you could see exactly where the problems were, fixing them were trivial, right? Fixing the problems wasn't really the challenge. It was knowing where the problems were and what was causing them."
22%
Initial Accuracy
6-12 months, 6-8 engineers
92%
Final Accuracy (6 months)
Exceeded 90% launch threshold
4.2x
Accuracy Improvement
From failing to production-ready
The Key Insight
Diagnosis > Treatment
The system wasn't broken—the evaluation was lying. Once they could see exactly where problems occurred (segmented by document type), fixing them was trivial. This became the largest document processing workload on AWS in North America.
The 7 Habits Framework
Battle-tested patterns from AWS that separate successful GenAI projects from failed science experiments.
Habit 1: FAST - Speed Enables Iteration
"I've never seen a workload scale unless it's gone through many many iterations."
Target: 30 Seconds Total
Breakdown of a fast evaluation cycle
- • 10s: Generation (100 test cases in parallel)
- • 10s: Judging (100 parallel AI calls)
- • 10s: Summarization by category
Slow vs. Fast Evals
The velocity multiplier
Slow Evals: 4-8 changes per month
Fast Evals: Hundreds of changes/tests per day
Result: ~25x faster iteration
Habit 2: QUANTIFIABLE - Produce Numbers, Not Opinions
"Even if there's a little bit of jitter in the score, if we have enough test cases and we average across those test cases that jitter goes out just like in grade school."
The Grade School Principle
Why 100+ test cases is the minimum
LLMs have inherent non-determinism (jitter). By running 100+ test cases and averaging the results, random variations cancel out—just like how your final grade in school was the average of all assignments, not a single quiz.
Habit 3: EXPLAINABLE - Evaluate Reasoning, Not Just Outputs
"Look at how the model is reasoning and in particularly I said reasoning for generation and scoring. Look at how your judge is reasoning as well."
— Justin Muller
The critical importance of evaluating the reasoning process, not just the final output
20:09-21:20The Happy Path Problem
When correct outputs have wrong reasoning
Weather Company Example:
Input: "It's raining and 40° and windy"
Output: "Today it's sunny and bright"
Reasoning: "It's important to mental health to be happy, so I decided not to talk about the rain"
The model gave wrong output—but even worse, it had deceptive reasoning. Without evaluating reasoning, you'd never catch this.
Debugging Acceleration
Why reasoning evaluation matters
"Now that we've looked behind the scenes and we've seen kind of what the model how the model got there, we suddenly have a lot more insight into how to fix the problem."
— Justin Muller (09:22-09:27)
Habit 4: SEGMENTED - Evaluate Each Step Individually
"In practice, almost all scaled workloads are multiple steps, right? There's very very few workloads I've ever seen that are a single prompt."
Prompt Decomposition Benefits
Why multi-step workflows matter
- • Find specific problem areas: Know exactly which step fails
- • Choose the right model for each step: Nova Micro for routing, larger models for reasoning
- • Remove dead tokens: Eliminate unnecessary instructions (cost optimization)
- • Attach evals to each segment: Granular visibility
Case Study: Wind Speed Math Bug
Problem: 2-3% of the time, the LLM said "seven is less than five so it's not windy" because it compared numbers in natural language.
Solution: Prompt decomposition—used Python for mathematical comparison instead of LLM.
Result: 97% → 100% accuracy
Habit 5: DIVERSE - Cover All Use Cases
"The exercise of building 100 test cases is very valuable for the team to figure out what the scope of the project is."
Test Case Composition
The 100-test-case rule of thumb
- • 100 test cases: Core use cases (minimum for statistical significance)
- • 3-4 examples: Edge cases and corner cases
- • Include negatives: Examples of what NOT to answer (defines boundaries)
Habit 6: TRADITIONAL - Don't Abandon Traditional ML
"There are a lot of traditional techniques that are very very powerful... traditional tooling is still very very powerful and very important in the context of GenAI evaluations."
Right Tool for the Job
When to use traditional ML vs. LLMs
- • Numeric outputs: Use Python/math for comparison (not LLM)
- • RAG architecture: Use retrieval precision, F1 scores
- • Cost & latency: Traditional monitoring tools
- • Don't force LLMs: If there's a better traditional tool, use it
Habit 7: NUMEROUS - Statistical Significance
Never Enough Tests
Why more test cases always beat fewer
Just like in grade school, your final score is the average of all assignments. With more test cases:
- • LLM non-determinism (jitter) averages out
- • Edge cases get proper coverage
- • Confidence in scores increases
- • Minimum: 100 test cases; Ideal: 500+
⚠️ CRITICAL WARNING: Gold Standard Sets
"This is a terrible place to use GenAI. If you use GenAI to create your gold standard set... you've built a system that generates the same errors that the GenAI system has."
Best Practice: Use "silver standard" - GenAI generates initial guesses, human reviews for accuracy. Never trust GenAI to create ground truth.
Key Takeaways
1. Evals Are Product Features
Not Engineering Chores
- •Evaluations determine what your system can reliably do
- •They are the foundation of user trust
- •Without evals, you cannot ship confidently
- •Invest in evals first, optimize prompts second
2. Speed Enables Iteration
The 30-Second Rule
- •Target: <30 seconds for full eval suite
- •Fast evals: Hundreds of changes/tests per day
- •Slow evals: 4-8 changes per month
- •25x faster iteration = faster time to production
3. Evaluate Reasoning
Not Just Outputs
- •Happy path problem: Good output, bad reasoning
- •Ask models to explain their thinking
- •Debug the reasoning process, not just answers
- •Weather company example: Deceptive reasoning caught
4. Segment Everything
Prompt Decomposition
- •Almost all production workloads are multi-step
- •Attach evals to each segment
- •Choose right tool for each step (Python vs LLM)
- •Remove dead tokens for cost optimization
5. Use Traditional Tools
Right Tool for the Job
- •Numeric evaluation: Use Python, not LLM
- •RAG: Use retrieval precision, F1 scores
- •Don't abandon 50 years of software engineering wisdom
- •LLMs are powerful but not universal
6. Never Enough Tests
Statistical Significance
- •Minimum: 100 test cases
- •Ideal: 500+ test cases
- •Include edge cases and negatives
- •More tests = more confidence = better production
7. Science Project Filter
Team Attitude Matters
- •"Can you just give me the toys?" → Science project
- •"Why don't we spend four hours on eval?" → Success
- •100x ROI, 10,000% cost reduction possible
- •Evals are the missing piece to scaling GenAI
8. Diagnosis > Treatment
The Core Insight
- •Document processing: 22% → 92% accuracy
- •Once problems were visible, fixing was trivial
- •The eval was lying, not the system
- •Segmentation reveals what averaging hides
Source Video
7 Habits of Highly Effective Generative AI Evaluations
Justin Muller • Principal Applied AI Architect at AWS
Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers AWS's 7 Habits framework for GenAI evaluations, including real-world case studies from document processing (22% → 92% accuracy) and weather company implementations.
Key Concepts: GenAI evaluations, LLM-as-judge, prompt decomposition, semantic routing, gold standard sets, RAG evaluation, F1 scores, production AI, evaluation speed, statistical significance, reasoning evaluation