AI Evaluation

Production Framework

Shipping AI That Works: An Evaluation Framework for PMs

Aman Khan from Arize delivers a comprehensive framework for evaluating AI systems. Learn why even OpenAI and Anthropic leaders publicly admit their models hallucinate, the LLM-as-judge methodology with 4 key components, and how to transition from vibe coding to thrive coding in production.

Even Model Vendors Admit Hallucination

"Both of the product leaders of those companies are telling you that their models hallucinate and that it's really important to write eval."

— Aman Khan, referencing Kevin Weil (OpenAI CPO) and Mike Krieger (Anthropic CPO) from Lenny's conference

Watch key moment (6:40)

86 min

Duration

Aman Khan

Speaker

AI Engineer Conf

Event

The Core Problem: Models Hallucinate

The conversation about AI reliability isn't speculation — it's coming directly from the companies building these models. At Lenny's conference, both OpenAI CPO Kevin Weil and Anthropic CPO Mike Krieger emphasized the importance of evaluation frameworks because their models hallucinate.

"Both of the product leaders of those companies are telling you that their models hallucinate and that it's really important to write eval."

— Aman Khan, referencing OpenAI CPO Kevin Weil and Anthropic CPO Mike Krieger (06:44)

400

The Fundamental Reality

This isn't a technical limitation you can engineer around — it's inherent to how LLMs work. When the people SELLING you the product say "it's not reliable, you need eval frameworks," you must listen. The question shifts from "will it work?" to "how do we EVALUATE if it works?"

Software Testing vs AI Evaluation

The biggest mistake teams make is applying traditional software testing methods to AI systems. The fundamental difference: software is deterministic, AI is nondeterministic.

Software is deterministic. You know, 1 plus 1 equals 2. LLM agents are nondeterministic.

The fundamental distinction that explains why unit tests don't work for AI systems. Traditional software produces consistent outputs; AI agents can take multiple valid paths to complete tasks.

Watch (08:08)

Traditional Software Testing

✓ Deterministic outputs (1+1=2 always)
✓ Unit tests verify exact results
✓ Integration tests check connections
✓ Pass/fail binary evaluation
✓ Relies on existing codebase

AI Evaluation

✗ Nondeterministic (multiple valid outputs)
✗ Agents can take different paths
✗ Quality spectrum, not pass/fail
✗ Relies on YOUR proprietary data
✗ Evaluates outcome quality, not execution

LLM-as-Judge: The 4-Component Framework

The production-ready approach to evaluating AI systems at scale. This framework has four critical components that must be present in every evaluation prompt.

Role

Tell the agent what it is. 'You are an expert code reviewer' or 'You are a helpful assistant evaluating customer service responses.' Sets the context and expertise level.

Task

What you want it to accomplish. 'Evaluate this code for bugs' or 'Assess whether this response is helpful and accurate.' Clear, specific instructions.

Context

Information provided in curly braces { }. Includes examples of good vs bad, relevant documentation, business rules, or reference data. The more context, the better the evaluation.

Goal

What to evaluate. Use text labels, not numbers. 'Toxic vs not toxic,' 'Helpful vs not helpful,' or 'Accurate vs inaccurate.' NEVER use 1-10 scales.

Critical Best Practice: Text Labels, Not Numbers

Quote: "Even though we have like PhD level LLMs, they're still really bad at numbers." — Aman Khan (10:48)

Technical reason: LLMs use sub-word tokenization. Numbers get split into tokens (e.g., "7" might be tokenized differently than "seven"). This makes numerical scoring unreliable and inconsistent.

Solution: Always use text labels: "good/bad/toxic/not toxic." Map to numerical scores ONLY after text evaluation, if absolutely needed for your system.

Watch (10:48)

Vibe Coding → Thrive Coding: The Paradigm Shift

There's a place for rapid prototyping with AI tools. But shipping to production requires a fundamentally different approach — one based on data and confidence, not intuition.

Vibe Coding (Prototyping)

Fast prototyping based on intuition. Use tools like Bolt and Lovable to iterate quickly. Great for hackathons, MVPs, and learning.

Intuition-based development

Ship it and see if it works

Low confidence, high risk for production

Thrive Coding (Production)

Data-driven development with systematic evaluation. Build eval frameworks to measure confidence before deploying.

Data-driven confidence

Measure before shipping

High confidence, managed risk

Thrive coding in my mind is really using data to basically do the same thing as vibe coding, like still build your application, but you'll be able to use data to be more confident in the output.

The key difference: Both approaches build applications fast. But thrive coding uses evaluation data to ensure production readiness.

Watch (12:50)

Production Warning

"The problem is you can't really do that in a production environment right." — Aman Khan (12:19). Vibe coding has its place for hacks and prototypes, but NEVER ship to production without evaluation frameworks.

Top 10 Key Insights

The most valuable, actionable insights from Aman Khan's 86-minute talk on shipping AI that works.

1. Even Model Vendors Say Their Models Hallucinate

Rating: 10/10 - Fundamental truth that changes everything

Kevin Weil (OpenAI CPO) and Mike Krieger (Anthropic CPO) both publicly emphasize that their models hallucinate and evaluation frameworks are critical. When the people selling you the product say it's unreliable, build eval frameworks BEFORE deploying to production.

Watch (06:44)

2. Software Testing ≠ AI Evaluation

Rating: 10/10 - Foundational concept many teams miss

Software is deterministic (1+1=2). LLM agents are nondeterministic and can take multiple valid paths to complete tasks. Stop trying to apply unit test logic to AI systems. Build eval frameworks that handle nondeterminism and test outcome quality, not execution paths.

Watch (08:08)

3. LLMs Are Terrible at Numbers — Here's Why

Rating: 9/10 - Counter-intuitive but critical technical detail

Even PhD-level LLMs struggle with numerical scoring due to token representation. Sub-word tokenization means numbers get split unpredictably. NEVER use 1-10 scales in evals. Use text labels ("good", "bad", "toxic", "not toxic") instead. Only map to numbers AFTER text evaluation if absolutely needed.

Watch (10:48)

4. Agents Should 'Hallucinate in the Right Way'

Rating: 8/10 - Nuanced perspective on AI behavior

You don't want to eliminate all creativity — you want controlled creativity. Design evals that test for "right way" hallucination. Focus on output usefulness and quality, not just factual accuracy. Some of the most valuable AI applications require generative capability.

5. Evals Are Your Competitive Moat

Rating: 8/10 - Strategic insight from self-driving car experience

Aman's experience at Cruise (self-driving cars) taught him that eval systems are competitive advantages. Your data + how you evaluate on it differentiates your AI product. Invest heavily in eval frameworks and treat your eval data as proprietary advantage.

6. Multi-Agent Systems Break Traditional Testing

Rating: 8/10 - Addresses emerging challenge

Agents can take multiple paths to complete tasks, which breaks unit test assumptions. Build evals that account for multiple valid execution paths. Test outcome quality, not the specific path taken. Design for nondeterministic agent behavior from the start.

7. Your Data = Your Differentiator

Rating: 7/10 - Shifts focus from model to data

Integration tests rely on existing codebases and documentation. But AI agents rely on YOUR proprietary data. Build domain-specific datasets for your use cases. Design evals that test on YOUR real data, not generic benchmarks.

8. The Vibe Coding Trap

Rating: 8/10 - Direct call-out of common mistake

Bolt and Lovable are great for fast prototyping, but "you can't really do that in a production environment." Vibe coding = intuition-based development (OK for hacks). Thrive coding = data-driven development (REQUIRED for production). Build eval frameworks to transition from vibe to thrive.

Watch (12:19)

Real-World Applications

Aman Khan brings deep experience from leading companies working on AI and ML systems at scale.

Arize (Current)

AI observability and evaluation platform. Working on eval systems for AI/LLM agents for 3.5 years.

Spotify

ML platform and recommender systems. Worked on Discover Weekly, search, and embeddings at scale.

Cruise

Self-driving car company. Started as engineer, became PM for evaluation systems. Learned eval frameworks as competitive moat.

Arize Customers Include

Uber

Instacart

Duolingo

These companies use Arize's evaluation platform to ship AI that works in production.

Implementation Checklist

How to implement the LLM-as-judge framework in your organization.

Building Your Eval Framework

Define Role: Specify what the AI judge is (expert reviewer, helpful assistant, etc.)
Specify Task: Clearly state what to evaluate (code quality, helpfulness, accuracy, etc.)
Provide Context: Include examples of good vs bad, relevant docs, business rules in { }
Set Goal: Use text labels (toxic/not toxic) NEVER numerical scores (1-10)
Test on Real Data: Evaluate on YOUR proprietary data, not generic benchmarks
Measure Confidence: Track eval results to build confidence before production deployment
Transition from Vibe to Thrive: Prototype fast, but ship with data-driven confidence