RAG Evaluation

Data-Driven AI

How to Look at Your Data

A practical guide to evaluating RAG systems using fast evals and cluster analysis. Learn how Chroma approaches retrieval system optimization with systematic, data-driven methodologies.

"Our contention is that you can really only manage what you measure. Great measurement ultimately is what makes systematic improvement easy."

— Jeff Huber, Chroma CEO (00:00:58)

The Core Philosophy

Watch explanation

~19 min watch

Jeff Huber & Jason Liu, Chroma

AI Engineer Summit 2025

The RAG Evaluation Problem

Questions That Resonate Deeply

Every AI practitioner faces these fundamental questions about their retrieval systems.

"What chunking strategy should I use? Is my embedding model the best embedding model for my data? And more."

— Jeff Huber (00:00:49)

Watch the questions (00:00:49)

Option 1: Guess and Check

Cross your fingers and hope it works. Not recommended.

Option 2: LLM as Judge

Use frameworks to check factuality. Costs $600 and takes 3 hours.

Option 3: Public Benchmarks

Look at MTEB scores. But benchmark data is overly clean.

Part 1: Fast Evals Framework

What is a Fast Eval?

A fast eval is simply a set of query and document pairs—a golden dataset that enables rapid, inexpensive testing.

"A fast eval is simply a set of query and document pairs. So the first step is if this query is put in, this document should come out. A set of those is called a golden data set."

— Jeff Huber (00:02:15)

Watch definition (00:02:15)

The Golden Dataset Approach

Create query-document pairs, embed all queries, retrieve results, and measure success rate.

Key insight: Fast evals are very fast and very inexpensive to run. This enables you to run a lot of experiments quickly and cheaply—experimentation time goes down significantly when metrics run extremely quickly for pennies.

Generating Queries with LLMs

Don't have queries yet? You can use LLMs to write good questions that are representative of real-world queries.

"We found that you can actually teach LLMs how to write queries. Doing naive like 'Hey LM, write me a question for this document'—not a great strategy. However, we found that you can actually teach LLMs how to write queries."

— Jeff Huber (00:03:14)

Watch explanation (00:03:14)

$600

Cost of LLM-as-Judge

Versus pennies for fast evals

3+ hours

Time for Traditional Evals

Versus seconds for fast evals

Case Study: Weights & Biases Chatbot

Real-World Validation

Chroma worked with Weights & Biases to evaluate embedding models on their actual chatbot queries, comparing ground truth (logged queries) with synthetically generated queries.

"These are actual queries that were logged in Weave and then sent over. And then there's generated. These are the ones that are synthetically generated. We want to see that those are pretty close and we want to see that they are always the same kind of in order of accuracy."

— Jeff Huber (00:05:44)

Watch methodology (00:05:44)

Surprising Finding #1: Original Model Underperformed

Text-embedding-3-small, the original model used, performed worst of all models evaluated.

"The original embedding model used for this application was actually text-embedding-3-small. This actually performed the worst out of all the embedding models that we evaluated."

— Jeff Huber (00:06:13)

Watch finding (00:06:13)

Surprising Finding #2: Benchmark Leader Didn't Win

OpenAI's text-embedding-3-small dominates MTEB English benchmarks but performed poorly on this specific application.

"If you look at MTEB, OpenAI embeddings v3 does very well in English. It's like way better than anything else but for this application it didn't actually perform that well."

— Jeff Huber (00:06:30)

Watch analysis (00:06:30)

Surprising Finding #3: Voyage AI Won

Voyage 3 Large performed best—empirically determined by running fast evals on actual data.

"It was actually the Voyage 3 Large model which performed the best and that was empirically determined by actually running this fast eval and looking at your data."

— Jeff Huber (00:06:39)

Watch winner (00:06:39)

Key Takeaway

Benchmark performance ≠ Real-world performance. The only way to know if an embedding model is best for your data is to evaluate it on your data using fast evals. Public benchmarks use overly clean, synthetic queries that don't represent real usage patterns.

Part 2: Cluster Analysis & Product Decisions

From Inputs to Outputs

Jason Liu presents Part 2: analyzing conversation outputs to make product decisions through clustering and segmentation.

"If you have a bunch of queries that users are putting in or even a couple of hundred of conversations, it's pretty good to just look at everything manually. But then when you have a lot of users and actual good product, you might get thousands of queries or tens of thousands of conversations and now you run into an issue where there's too much volume to manually review."

— Jason Liu (00:07:42)

Watch transition (00:07:42)

The Marketing Analogy

A powerful analogy for understanding segmentation: generic KPIs tell you little, but segmentation reveals actionable insights.

"Imagine we run our evals and the number is 0.5. I don't really know what that means. Factuality is point 6. I don't know if that's good or bad. But imagine we run a marketing campaign and our KPI is 0.5. There's not much we can do. But if we realize that 80% of our users are under 35 and 20% are over and we realize that the younger audience performs well and the older audience performs poorly... now we can make a decision."

— Jason Liu (00:09:13)

Watch analogy (00:09:13)

The power of segmentation: Just by drawing a line in the sand and deciding which segment to target, you can make decisions on what to improve. Generic improvement is hard; targeted improvement is strategic.

Extracting Value from Conversations

The Data Already Exists

User feedback, frustration patterns, and retry information already exists in conversations. The key is extracting it systematically.

"A lot of the feedback you give is in those conversations, right? We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the information exists in those conversations. And the frustration and the retry patterns that exist can be extracted from those conversations."

— Jason Liu (00:08:42)

Watch insight (00:08:42)

Metadata Extraction Pipeline

Extract structured data: summaries, tools used, errors noticed, conversations, satisfaction metrics, frustration metrics.

"The idea is that we can build this portfolio of metadata that we can extract. And then what we can do is we can embed this, find clusters, identify segments and then start testing our hypothesis."

— Jason Liu (00:10:28)

Watch pipeline (00:10:28)

Step 1: Summarization

Extract topics discussed, frustrations, errors, and other metadata from conversations.

Step 2: Clustering

Group conversations to find cohesive themes and user segments.

Step 3: Hierarchy Building

Create hierarchical cluster structures for multi-level analysis.

Step 4: KPI Comparison

Compare evals across different segments to identify patterns.

Introducing Cura: Open Source Analysis Library

Cura: Summarize, Cluster, Compare

Chroma built Cura to summarize conversations, cluster them, build hierarchies, and compare evals across KPIs.

"This is why we built a library called Cura that allows us to summarize conversations, cluster them, build hierarchies of these clusters and ultimately allow us to compare our eval across different KPIs."

— Jason Liu (00:11:20)

Watch introduction (00:11:20)

From Abstract to Actionable

Transform vague metrics like 'factuality is 0.6' into specific insights like 'factuality is low for time-filter queries but high for contract search.'

"If we have factuality is 6 that's really hard but if it turns out that factuality is really low for queries that require time filters, right? Or factuality is really high when queries revolve on, you know, contract search. Now we know something's happening in one area, something's happening in another."

— Jason Liu (00:11:33)

Watch example (00:11:33)

The Pipeline: Simple but Powerful

Models for summarization, models for clustering, and models for aggregation—traditional data analysis applied to AI conversations.

"We have models to do summarization, models to do clustering, and models that do this aggregation step... no different than any kind of product engineer or any kind of data scientist."

— Jason Liu (00:11:53)

Watch pipeline (00:11:53)

Live Demo: Discovering User Patterns

Jason demonstrates clustering fake conversations from Gemini, revealing themes:

• Data visualization conversations
• SEO content requests
• Authentication errors
• Technical support needs
• Tool availability questions

Key Takeaways

For Input Evaluation

Use fast evals, not benchmarks

Create golden datasets of query-document pairs
Use LLMs to generate realistic queries (not naive ones)
Run experiments quickly for pennies, not hours
Never trust benchmarks—evaluate on your data

For Output Analysis

Use clustering to discover user segments

Extract metadata from conversations
Cluster to find cohesive user groups
Compare KPIs across segments
Make targeted product decisions

For Engineering Decisions

Let data drive your choices

Embedding models: test empirically on your data
Chunking strategies: measure retrieval success rates
System optimization: systematic not subjective
Benchmark leaders ≠ your best choice

For Product Strategy

Understand users to build better products

Manual review works for small datasets
Clustering scales to thousands of conversations
Generic metrics hide segmentation insights
Frustration patterns exist in conversation data

Video Reference

How to Look at Your Data — Jeff Huber (Chroma) + Jason Liu

A comprehensive guide to evaluating RAG systems using fast evals and cluster analysis for data-driven decision making.

Chroma

RAG

Fast Evals

Cluster Analysis

Data-Driven AI

Watch Full Video

Duration: ~19 min
Event: AI Engineer Summit 2025
Video ID: jryZvCuA0Uc
Speakers: Jeff Huber (Chroma CEO), Jason Liu
Company: trychroma.com

Key Timestamps

00:00:49 — "What chunking strategy should I use?" — The core questions

00:00:58 — "You can really only manage what you measure" — Core philosophy

00:02:15 — "A fast eval is simply a set of query and document pairs" — Definition

00:03:14 — "Teach LLMs how to write queries" — Synthetic queries

00:06:13 — "Text-embedding-3-small performed the worst" — Finding #1

00:06:39 — "Voyage 3 Large performed the best" — Finding #3

00:07:42 — "Thousands of conversations, too much to review" — Part 2 intro

00:09:13 — "Marketing analogy: segmentation reveals insights" — Key concept

00:11:20 — "Built Cura: summarize, cluster, compare" — Tool introduction

00:11:33 — "Factuality low for time filters, high for contract search" — Segmentation insights

Research Sources

Chroma

This analysis is based on the full transcript of Jeff Huber and Jason Liu's talk at AI Engineer Summit 2025 about evaluating RAG systems using fast evals and cluster analysis.

Video: youtube.com/watch?v=jryZvCuA0Uc

Speakers: Jeff Huber (CEO), Jason Liu

Event: AI Engineer Summit 2025

Duration: ~19 minutes

Analysis Date: December 30, 2025

Research Methodology: Full transcript analysis with no scanning or grep. All insights extracted with YouTube timestamps for verification. Real quotes from the speakers, not paraphrasing. Technical claims are as stated by Chroma; independent verification not available.

Related Resources:
• Full report with notebooks: Available from Chroma
• Cura library: Open source tools for conversation clustering
• Weights & Biases case study: Real-world fast eval implementation