Production AI Framework

7 Habits of Highly Effective GenAI Evaluations

Justin Muller, Principal Applied AI Architect at AWS, reveals the battle-tested 7 Habits framework that transformed document processing from 22% to 92% accuracy. Learn how to build evaluation systems that enable rapid iteration and production deployment.

"The number one thing that I see across all workloads is a lack of evaluations. And in particular, I call it the missing piece to scaling GenAI."

Justin Muller, AWS (01:45-01:48)

22% → 92%

Document accuracy in 6 months

30 seconds

Target eval runtime

100x

ROI for successful projects

25x

Faster iteration (hundreds/day)

The Evaluation Crisis: Why Most AI Projects Fail

The #1 problem across all GenAI workloads is a lack of proper evaluations. Without evals, teams cannot identify problems, measure progress, or achieve production-ready accuracy.

"When a team comes back and says 'Oh gosh, two hours on eval sounds boring. Can you just give me the toys to play with?' I know right away that's a science project and it's not going to go anywhere."

— Justin Muller, Principal Applied AI Architect at AWS

How to identify projects that will fail vs succeed based on their attitude toward evaluations

06:10-06:17

Science Project vs. Production

The attitude difference that predicts success

Science Projects: "Can you just give me the toys to play with?" → Never go anywhere

Successful Projects: "Why don't we spend four hours on eval?" → 100x ROI, 10,000% cost reduction

The Scale Unlock

Why evals are non-negotiable

"When we add evaluations, it unlocks the ability to scale. And we're going to look at how that happens, but it's by far the most common way to unlock scale."

— Justin Muller (02:05-02:09)

Case Study: From 22% to 92% Accuracy

A document processing customer struggled for 6-12 months with 6-8 engineers, achieving only 22% accuracy. The VP was considering cancellation. Adding an evaluation framework transformed everything.

"Once the evaluation framework was in place and you could see exactly where the problems were, fixing them were trivial, right? Fixing the problems wasn't really the challenge. It was knowing where the problems were and what was causing them."

— Justin Muller

The pivotal moment when evaluations revealed the path to 92% accuracy

02:20-04:00

22%

Initial Accuracy

6-12 months, 6-8 engineers

92%

Final Accuracy (6 months)

Exceeded 90% launch threshold

4.2x

Accuracy Improvement

From failing to production-ready

The Key Insight

Diagnosis > Treatment

The system wasn't broken—the evaluation was lying. Once they could see exactly where problems occurred (segmented by document type), fixing them was trivial. This became the largest document processing workload on AWS in North America.

The 7 Habits Framework

Battle-tested patterns from AWS that separate successful GenAI projects from failed science experiments.

Habit 1: FAST - Speed Enables Iteration

"I've never seen a workload scale unless it's gone through many many iterations."

— Justin Muller

Why evaluation speed is the critical factor for AI success

16:30-16:34

Target: 30 Seconds Total

Breakdown of a fast evaluation cycle

• 10s: Generation (100 test cases in parallel)
• 10s: Judging (100 parallel AI calls)
• 10s: Summarization by category

Slow vs. Fast Evals

The velocity multiplier

Slow Evals: 4-8 changes per month

Fast Evals: Hundreds of changes/tests per day

Result: ~25x faster iteration

Habit 2: QUANTIFIABLE - Produce Numbers, Not Opinions

"Even if there's a little bit of jitter in the score, if we have enough test cases and we average across those test cases that jitter goes out just like in grade school."

— Justin Muller

How statistical averaging smooths out LLM non-determinism

18:20-19:05

The Grade School Principle

Why 100+ test cases is the minimum

LLMs have inherent non-determinism (jitter). By running 100+ test cases and averaging the results, random variations cancel out—just like how your final grade in school was the average of all assignments, not a single quiz.

Habit 3: EXPLAINABLE - Evaluate Reasoning, Not Just Outputs

"Look at how the model is reasoning and in particularly I said reasoning for generation and scoring. Look at how your judge is reasoning as well."

— Justin Muller

The critical importance of evaluating the reasoning process, not just the final output

20:09-21:20

The Happy Path Problem

When correct outputs have wrong reasoning

Weather Company Example:

Input: "It's raining and 40° and windy"
Output: "Today it's sunny and bright"
Reasoning: "It's important to mental health to be happy, so I decided not to talk about the rain"

The model gave wrong output—but even worse, it had deceptive reasoning. Without evaluating reasoning, you'd never catch this.

Debugging Acceleration

Why reasoning evaluation matters

"Now that we've looked behind the scenes and we've seen kind of what the model how the model got there, we suddenly have a lot more insight into how to fix the problem."

— Justin Muller (09:22-09:27)

Habit 4: SEGMENTED - Evaluate Each Step Individually

"In practice, almost all scaled workloads are multiple steps, right? There's very very few workloads I've ever seen that are a single prompt."

— Justin Muller

Why prompt decomposition is mandatory for production systems

21:24-22:10

Prompt Decomposition Benefits

Why multi-step workflows matter

• Find specific problem areas: Know exactly which step fails
• Choose the right model for each step: Nova Micro for routing, larger models for reasoning
• Remove dead tokens: Eliminate unnecessary instructions (cost optimization)
• Attach evals to each segment: Granular visibility

Case Study: Wind Speed Math Bug

Problem: 2-3% of the time, the LLM said "seven is less than five so it's not windy" because it compared numbers in natural language.

Solution: Prompt decomposition—used Python for mathematical comparison instead of LLM.

Result: 97% → 100% accuracy

Habit 5: DIVERSE - Cover All Use Cases

"The exercise of building 100 test cases is very valuable for the team to figure out what the scope of the project is."

— Justin Muller

How test case construction clarifies project scope and boundaries

22:10-22:30

Test Case Composition

The 100-test-case rule of thumb

• 100 test cases: Core use cases (minimum for statistical significance)
• 3-4 examples: Edge cases and corner cases
• Include negatives: Examples of what NOT to answer (defines boundaries)

Habit 6: TRADITIONAL - Don't Abandon Traditional ML

"There are a lot of traditional techniques that are very very powerful... traditional tooling is still very very powerful and very important in the context of GenAI evaluations."

— Justin Muller

Why traditional ML metrics and tools remain essential

22:30-23:25

Right Tool for the Job

When to use traditional ML vs. LLMs

• Numeric outputs: Use Python/math for comparison (not LLM)
• RAG architecture: Use retrieval precision, F1 scores
• Cost & latency: Traditional monitoring tools
• Don't force LLMs: If there's a better traditional tool, use it

Habit 7: NUMEROUS - Statistical Significance

Never Enough Tests

Why more test cases always beat fewer

Just like in grade school, your final score is the average of all assignments. With more test cases:

• LLM non-determinism (jitter) averages out
• Edge cases get proper coverage
• Confidence in scores increases
• Minimum: 100 test cases; Ideal: 500+

⚠️ CRITICAL WARNING: Gold Standard Sets

"This is a terrible place to use GenAI. If you use GenAI to create your gold standard set... you've built a system that generates the same errors that the GenAI system has."

Best Practice: Use "silver standard" - GenAI generates initial guesses, human reviews for accuracy. Never trust GenAI to create ground truth.

Key Takeaways

1. Evals Are Product Features

Not Engineering Chores

•Evaluations determine what your system can reliably do
•They are the foundation of user trust
•Without evals, you cannot ship confidently
•Invest in evals first, optimize prompts second

2. Speed Enables Iteration

The 30-Second Rule

•Target: <30 seconds for full eval suite
•Fast evals: Hundreds of changes/tests per day
•Slow evals: 4-8 changes per month
•25x faster iteration = faster time to production

3. Evaluate Reasoning

Not Just Outputs

•Happy path problem: Good output, bad reasoning
•Ask models to explain their thinking
•Debug the reasoning process, not just answers
•Weather company example: Deceptive reasoning caught

4. Segment Everything

Prompt Decomposition

•Almost all production workloads are multi-step
•Attach evals to each segment
•Choose right tool for each step (Python vs LLM)
•Remove dead tokens for cost optimization

5. Use Traditional Tools

Right Tool for the Job

•Numeric evaluation: Use Python, not LLM
•RAG: Use retrieval precision, F1 scores
•Don't abandon 50 years of software engineering wisdom
•LLMs are powerful but not universal

6. Never Enough Tests

Statistical Significance

•Minimum: 100 test cases
•Ideal: 500+ test cases
•Include edge cases and negatives
•More tests = more confidence = better production

7. Science Project Filter

Team Attitude Matters

•"Can you just give me the toys?" → Science project
•"Why don't we spend four hours on eval?" → Success
•100x ROI, 10,000% cost reduction possible
•Evals are the missing piece to scaling GenAI

8. Diagnosis > Treatment

The Core Insight

•Document processing: 22% → 92% accuracy
•Once problems were visible, fixing was trivial
•The eval was lying, not the system
•Segmentation reveals what averaging hides

Source Video

7 Habits of Highly Effective Generative AI Evaluations

Justin Muller • Principal Applied AI Architect at AWS

Video ID: wHhlvcQgi9M•Duration: ~25 minutes

Watch on YouTube

Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers AWS's 7 Habits framework for GenAI evaluations, including real-world case studies from document processing (22% → 92% accuracy) and weather company implementations.

Key Concepts: GenAI evaluations, LLM-as-judge, prompt decomposition, semantic routing, gold standard sets, RAG evaluation, F1 scores, production AI, evaluation speed, statistical significance, reasoning evaluation