How to Look at Your Data
A practical guide to evaluating RAG systems using fast evals and cluster analysis. Learn how Chroma approaches retrieval system optimization with systematic, data-driven methodologies.
"Our contention is that you can really only manage what you measure. Great measurement ultimately is what makes systematic improvement easy."
The RAG Evaluation Problem
Questions That Resonate Deeply
Every AI practitioner faces these fundamental questions about their retrieval systems.
"What chunking strategy should I use? Is my embedding model the best embedding model for my data? And more."
— Jeff Huber (00:00:49)
Watch the questions (00:00:49)Option 1: Guess and Check
Cross your fingers and hope it works. Not recommended.
Option 2: LLM as Judge
Use frameworks to check factuality. Costs $600 and takes 3 hours.
Option 3: Public Benchmarks
Look at MTEB scores. But benchmark data is overly clean.
Part 1: Fast Evals Framework
What is a Fast Eval?
A fast eval is simply a set of query and document pairs—a golden dataset that enables rapid, inexpensive testing.
"A fast eval is simply a set of query and document pairs. So the first step is if this query is put in, this document should come out. A set of those is called a golden data set."
— Jeff Huber (00:02:15)
Watch definition (00:02:15)The Golden Dataset Approach
Create query-document pairs, embed all queries, retrieve results, and measure success rate.
Key insight: Fast evals are very fast and very inexpensive to run. This enables you to run a lot of experiments quickly and cheaply—experimentation time goes down significantly when metrics run extremely quickly for pennies.
Generating Queries with LLMs
Don't have queries yet? You can use LLMs to write good questions that are representative of real-world queries.
"We found that you can actually teach LLMs how to write queries. Doing naive like 'Hey LM, write me a question for this document'—not a great strategy. However, we found that you can actually teach LLMs how to write queries."
— Jeff Huber (00:03:14)
Watch explanation (00:03:14)$600
Cost of LLM-as-Judge
Versus pennies for fast evals
3+ hours
Time for Traditional Evals
Versus seconds for fast evals
Case Study: Weights & Biases Chatbot
Real-World Validation
Chroma worked with Weights & Biases to evaluate embedding models on their actual chatbot queries, comparing ground truth (logged queries) with synthetically generated queries.
"These are actual queries that were logged in Weave and then sent over. And then there's generated. These are the ones that are synthetically generated. We want to see that those are pretty close and we want to see that they are always the same kind of in order of accuracy."
— Jeff Huber (00:05:44)
Watch methodology (00:05:44)Surprising Finding #1: Original Model Underperformed
Text-embedding-3-small, the original model used, performed worst of all models evaluated.
"The original embedding model used for this application was actually text-embedding-3-small. This actually performed the worst out of all the embedding models that we evaluated."
— Jeff Huber (00:06:13)
Watch finding (00:06:13)Surprising Finding #2: Benchmark Leader Didn't Win
OpenAI's text-embedding-3-small dominates MTEB English benchmarks but performed poorly on this specific application.
"If you look at MTEB, OpenAI embeddings v3 does very well in English. It's like way better than anything else but for this application it didn't actually perform that well."
— Jeff Huber (00:06:30)
Watch analysis (00:06:30)Surprising Finding #3: Voyage AI Won
Voyage 3 Large performed best—empirically determined by running fast evals on actual data.
"It was actually the Voyage 3 Large model which performed the best and that was empirically determined by actually running this fast eval and looking at your data."
— Jeff Huber (00:06:39)
Watch winner (00:06:39)Key Takeaway
Benchmark performance ≠ Real-world performance. The only way to know if an embedding model is best for your data is to evaluate it on your data using fast evals. Public benchmarks use overly clean, synthetic queries that don't represent real usage patterns.
Part 2: Cluster Analysis & Product Decisions
From Inputs to Outputs
Jason Liu presents Part 2: analyzing conversation outputs to make product decisions through clustering and segmentation.
"If you have a bunch of queries that users are putting in or even a couple of hundred of conversations, it's pretty good to just look at everything manually. But then when you have a lot of users and actual good product, you might get thousands of queries or tens of thousands of conversations and now you run into an issue where there's too much volume to manually review."
— Jason Liu (00:07:42)
Watch transition (00:07:42)The Marketing Analogy
A powerful analogy for understanding segmentation: generic KPIs tell you little, but segmentation reveals actionable insights.
"Imagine we run our evals and the number is 0.5. I don't really know what that means. Factuality is point 6. I don't know if that's good or bad. But imagine we run a marketing campaign and our KPI is 0.5. There's not much we can do. But if we realize that 80% of our users are under 35 and 20% are over and we realize that the younger audience performs well and the older audience performs poorly... now we can make a decision."
— Jason Liu (00:09:13)
Watch analogy (00:09:13)The power of segmentation: Just by drawing a line in the sand and deciding which segment to target, you can make decisions on what to improve. Generic improvement is hard; targeted improvement is strategic.
Extracting Value from Conversations
The Data Already Exists
User feedback, frustration patterns, and retry information already exists in conversations. The key is extracting it systematically.
"A lot of the feedback you give is in those conversations, right? We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the information exists in those conversations. And the frustration and the retry patterns that exist can be extracted from those conversations."
— Jason Liu (00:08:42)
Watch insight (00:08:42)Metadata Extraction Pipeline
Extract structured data: summaries, tools used, errors noticed, conversations, satisfaction metrics, frustration metrics.
"The idea is that we can build this portfolio of metadata that we can extract. And then what we can do is we can embed this, find clusters, identify segments and then start testing our hypothesis."
— Jason Liu (00:10:28)
Watch pipeline (00:10:28)Step 1: Summarization
Extract topics discussed, frustrations, errors, and other metadata from conversations.
Step 2: Clustering
Group conversations to find cohesive themes and user segments.
Step 3: Hierarchy Building
Create hierarchical cluster structures for multi-level analysis.
Step 4: KPI Comparison
Compare evals across different segments to identify patterns.
Introducing Cura: Open Source Analysis Library
Cura: Summarize, Cluster, Compare
Chroma built Cura to summarize conversations, cluster them, build hierarchies, and compare evals across KPIs.
"This is why we built a library called Cura that allows us to summarize conversations, cluster them, build hierarchies of these clusters and ultimately allow us to compare our eval across different KPIs."
— Jason Liu (00:11:20)
Watch introduction (00:11:20)From Abstract to Actionable
Transform vague metrics like 'factuality is 0.6' into specific insights like 'factuality is low for time-filter queries but high for contract search.'
"If we have factuality is 6 that's really hard but if it turns out that factuality is really low for queries that require time filters, right? Or factuality is really high when queries revolve on, you know, contract search. Now we know something's happening in one area, something's happening in another."
— Jason Liu (00:11:33)
Watch example (00:11:33)The Pipeline: Simple but Powerful
Models for summarization, models for clustering, and models for aggregation—traditional data analysis applied to AI conversations.
"We have models to do summarization, models to do clustering, and models that do this aggregation step... no different than any kind of product engineer or any kind of data scientist."
— Jason Liu (00:11:53)
Watch pipeline (00:11:53)Live Demo: Discovering User Patterns
Jason demonstrates clustering fake conversations from Gemini, revealing themes:
- • Data visualization conversations
- • SEO content requests
- • Authentication errors
- • Technical support needs
- • Tool availability questions
Key Takeaways
For Input Evaluation
Use fast evals, not benchmarks
- Create golden datasets of query-document pairs
- Use LLMs to generate realistic queries (not naive ones)
- Run experiments quickly for pennies, not hours
- Never trust benchmarks—evaluate on your data
For Output Analysis
Use clustering to discover user segments
- Extract metadata from conversations
- Cluster to find cohesive user groups
- Compare KPIs across segments
- Make targeted product decisions
For Engineering Decisions
Let data drive your choices
- Embedding models: test empirically on your data
- Chunking strategies: measure retrieval success rates
- System optimization: systematic not subjective
- Benchmark leaders ≠ your best choice
For Product Strategy
Understand users to build better products
- Manual review works for small datasets
- Clustering scales to thousands of conversations
- Generic metrics hide segmentation insights
- Frustration patterns exist in conversation data
Video Reference
How to Look at Your Data — Jeff Huber (Chroma) + Jason Liu
A comprehensive guide to evaluating RAG systems using fast evals and cluster analysis for data-driven decision making.
Duration: ~19 min
Event: AI Engineer Summit 2025
Video ID: jryZvCuA0Uc
Speakers: Jeff Huber (Chroma CEO), Jason Liu
Company: trychroma.com
Key Timestamps
Research Sources
Chroma
This analysis is based on the full transcript of Jeff Huber and Jason Liu's talk at AI Engineer Summit 2025 about evaluating RAG systems using fast evals and cluster analysis.
Video: youtube.com/watch?v=jryZvCuA0Uc
Speakers: Jeff Huber (CEO), Jason Liu
Event: AI Engineer Summit 2025
Duration: ~19 minutes
Analysis Date: December 30, 2025
Research Methodology: Full transcript analysis with no scanning or grep. All insights extracted with YouTube timestamps for verification. Real quotes from the speakers, not paraphrasing. Technical claims are as stated by Chroma; independent verification not available.
Related Resources:
• Full report with notebooks: Available from Chroma
• Cura library: Open source tools for conversation clustering
• Weights & Biases case study: Real-world fast eval implementation