How to Build World-Class AI Products
Sarah Sachs (AI Lead, Notion) and Carlos Esteban (Braintrust) share their evaluation-first approach to building AI products that scale.
"All of the rigor and excellence that comes from building great AI products comes from observability and good evals and that's how you scale an engineering team. Ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals."
Key Metrics
Notion's approach to AI development in numbers
Time on Evaluation
90%
of engineering time spent on evals and iteration
Time on Prompting
10%
of engineering time spent on prompt engineering
Initial Dataset Size
5-10 rows
recommended for first evaluation run
Production Sampling
100%
for critical feature traces
Core Philosophy: Evaluation First
Why Notion spends 90% of their time on evaluation
"At a higher level, what I tell the team and what I tell people who are also building is like all of the rigor and excellence that comes from building great AI products comes from observability and good evals."
Foundation of AI product excellence
00:01:48"That's how you scale an engineering team. That's how you build good product and ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals and looking at our usage."
Time allocation: 10% prompting, 90% evaluation
00:02:00"That I believe is the right balance of work in order to know that you're not just shipping something that worked well in a demo with your VP or that you got working on the cow train to work and finally it did the thing that you wanted it to do."
Avoiding demo-driven development traps
00:02:14Key Takeaways
Actionable insights from Notion and Braintrust
Spend 90% on Evaluation, 10% on Prompting
Notion's AI team allocates the vast majority of their time to evaluation, observability, and iteration rather than prompt engineering. This shift from demo-driven development to data-driven iteration ensures products work consistently in production.
Start Small with 5-10 Data Rows
Don't wait for massive datasets. Begin with just 5-10 rows to get initial feedback, then continuously iterate and expand using production logs as your source of truth. This accelerates the feedback loop dramatically.
Treat User Feedback with Care
Thumbs up/down signals are inconsistent and temporally misaligned with current performance. Instead, extract the user's natural language request from feedback as the valuable signal for evaluation.
Use Traces for Production Debugging
Trace-based debugging provides visibility into multi-step AI applications, allowing you to drill down into individual spans and understand exactly how your application is performing in production.
Enable 100% Sampling for Critical Features
For production monitoring, configure online scoring to evaluate 100% of samples for critical traces. This ensures comprehensive observability rather than relying on sampled data.
Leverage Remote Eval for Complex Workflows
For complex workflows that can't be easily pushed to evaluation platforms, use remote eval to bridge local codebases with cloud evaluation tools while maintaining consistency with local development patterns.
Data Management Strategy
How to build and iterate on evaluation datasets
"In an AI interaction you can have five five LLM calls that happen between the user and like the notion AI response. We can extract just one of those and put it in a data set."
Extracting specific LLM calls from complex interactions
00:17:47"Data sets are hand curated and it can be just one aspect of your trace."
Hand-curated datasets vs. logs
00:17:41"So some tips here. We recommend for you to just get started. You don't need to create 200 rows in a data set to run your first eval. You know 5 10 rows is great. Just start small and get some feedback."
Start small: 5-10 rows for first eval
00:38:07"And you know the next thing is don't stop iterating. keep adding rows or tweaking rows using logs as that that source of truth, right? How are your users interacting with the feature that you're developing?"
Continuous iteration using production logs
00:38:21"Human reviews is another way of establishing ground truth. Very much needed in certain industries. If you're dealing in the medical space, you need doctors to look at the output or you need lawyers or people with highly specialized domain knowledge."
Human review for specialized domains
00:38:49Understanding User Feedback
Why thumbs up/down can be misleading and how to extract real value
"We don't really rely on thumbs up for anything except for maybe internal notinos thumbs uping things as like good golden data for fine-tuning, but we don't really rely there's like no consistency in what makes someone thumbs up something."
Limitations of thumbs up feedback
00:18:54"And for thumbs down data, um it's more just that this is a functionality that we know we didn't do our best work on. Um but that thumbs down could have been given in September of 2023."
Thumbs down doesn't reflect current performance
00:19:07"So we don't perform how we did in September 2023. So it doesn't necessarily align with the LLM as a judge because that's judging a particular experiment, not um what the production user experience was at that time."
Temporal misalignment of feedback data
00:19:17"And that's what makes this really powerful because we don't just need to look at what our output was in September 2023, right? Our data can be far more robust and last much longer because really what we're getting from the thumbs down is the natural language request from the user."
Value: extracting user intent from feedback
00:19:31Automated Prompt Optimization
Notion's experience with automated prompt optimization tools
"The question was about automated prompt optimization. The answer is yes. We have played with it. I'm not sure that a majority of our problems are as solved by that as they could be in other workplace contexts."
Notion has experimented with automated prompt optimization
00:18:07"I've been to like dinners and events where I've heard massive success from it. Maybe we haven't cracked it yet. Um, a lot of people have, but we've played with it."
Others report success, but Notion still exploring
00:18:17Scoring and Normalization
Best practices for creating and normalizing evaluation scores
"Yeah, the question was about aligning thumbs up, thumbs down with scoring functions. Um, they don't I mean, so a majority of things in our data set were things that are either thumbs up or thumbs down."
Data composition: mostly thumbs up/down
00:18:32"That's not necessarily what I think these scores are designed to do, right? But I can engineer my prompt and try to maximize some kind of score that I can get, right? But that does not mean that I have a range in which I try to operate."
Score optimization vs. operational ranges
00:56:30"Yeah, it's definitely something we've heard before. Um I think that's part of the the struggle of writing scores is trying to normalize them and and bring them into that range of zero to one. So up to you of of what you decide the floor and ceiling are."
Normalizing scores to 0-1 range
00:56:54Multi-Turn Conversation Evaluation
Strategies for evaluating complex agent workflows
"Do you have anything for evaluating multiurren conversations and as part of that would you be able to review the agent features?"
Multi-turn conversation evaluation
00:57:10"So the idea is that you could provide a whole back and forth in in context and then evaluate that multi-turn conversation at once. You could do it as well in the SDK."
Providing full conversation context for evaluation
00:57:33"Uh so here we could just grab the two but the idea is that the output of the initial prompt will become the input of the next and so on and then you can evaluate them as a unit right and as opposed to the the multi-turn right you're providing all the context at once"
Agents: chaining prompts with sequential evaluation
00:57:54"With the multi-turn extra messages approach, you're providing everything at once to the LLM. So it's one LM call. So yeah, you would need to comply with the context window of the model that you're working with."
Context window considerations for multi-turn
00:58:41"The agent feature though each prompt is its own LLM call."
Agents make multiple LLM calls
00:58:53Trace-Based Debugging
Using traces and spans for production debugging
"Think of this as a as a trace and all of these uh sort of under the hood or underneath it are the individual spans that we actually want to understand."
Traces and spans hierarchy
01:17:17"This this allows us the visibility into the multiple steps that the application will take and allows us like I like I uh showed you in that previous example to create scores for these individual spans as well."
Visibility into multi-step applications
01:17:25"But then being able to drill down into these individual things right so what are my commits what is my latest excuse me release uh and then again being able to uh to drill down into these becomes really beneficial as we start to need to understand how our application is performing"
Drilling down into spans for debugging
01:17:47"And this allows you to get very granular. Right now, we'll just configure all of these for the the overall trace, and we'll do it for 100% of the uh the the samples."
Granular scoring configuration
01:19:20Production Monitoring with Online Scoring
Monitoring production AI applications with configured scores
"This is pre-production, right? This is uh now when we have our logs running in production, we want to understand how our application is performing uh relevant or excuse me uh relative to the scores that we've already configured."
Production monitoring with configured scores
01:18:19"And this allows you to get very granular. Right now, we'll just configure all of these for the the overall trace, and we'll do it for 100% of the uh the the samples."
100% sampling for production evaluation
01:19:26Remote Evaluation
Bridging local codebases with cloud evaluation platforms
"The difference is that you use the d-dev flag. So, this will start that remote eval server. It will default to local host, but you can change it. Um and it will allow it to bind to to all the it will allow it to listen to all the network interfaces so you could access it uh via yeah access remote servers out in the world."
Remote eval server for distributed evaluation
01:38:26"Think of like the the more complex type of workflows that that you may have uh within your codebase and you don't have the ability necessarily to push those into brain trust."
Complex workflows that can't be easily pushed to Braintrust
01:40:25"We can still leverage this platform with that existing codebase via remote eval. And this sort of like uh interaction works very similar to to what I showed earlier with uh the eval locally via the SDK as well as the the playground."
Remote eval bridges local code with Braintrust platform
01:40:36Companies & Technologies
Tools and organizations mentioned
Notion
All-in-one workspace platform with AI integration. Sarah Sachs leads their AI team, focusing on evaluation-driven development.
Braintrust
Evaluation platform for AI products. Provides tools for datasets, scoring, trace-based debugging, and production monitoring.
Related Topics
Based on the AI Engineer Conference talk by Sarah Sachs (Notion) and Carlos Esteban (Braintrust).