Back to Highlights
AI Engineering
Product Development
Evaluation
Notion
Braintrust

How to Build World-Class AI Products

Sarah Sachs (AI Lead, Notion) and Carlos Esteban (Braintrust) share their evaluation-first approach to building AI products that scale.

Sarah Sachs
Braintrust logoCarlos Esteban
Watch on YouTube
"All of the rigor and excellence that comes from building great AI products comes from observability and good evals and that's how you scale an engineering team. Ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals."

Key Metrics

Notion's approach to AI development in numbers

Time on Evaluation

90%

of engineering time spent on evals and iteration

Time on Prompting

10%

of engineering time spent on prompt engineering

Initial Dataset Size

5-10 rows

recommended for first evaluation run

Production Sampling

100%

for critical feature traces

Core Philosophy: Evaluation First

Why Notion spends 90% of their time on evaluation

"At a higher level, what I tell the team and what I tell people who are also building is like all of the rigor and excellence that comes from building great AI products comes from observability and good evals."

Foundation of AI product excellence

00:01:48
"That's how you scale an engineering team. That's how you build good product and ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals and looking at our usage."

Time allocation: 10% prompting, 90% evaluation

00:02:00
"That I believe is the right balance of work in order to know that you're not just shipping something that worked well in a demo with your VP or that you got working on the cow train to work and finally it did the thing that you wanted it to do."

Avoiding demo-driven development traps

00:02:14

Key Takeaways

Actionable insights from Notion and Braintrust

Spend 90% on Evaluation, 10% on Prompting

Notion's AI team allocates the vast majority of their time to evaluation, observability, and iteration rather than prompt engineering. This shift from demo-driven development to data-driven iteration ensures products work consistently in production.

Start Small with 5-10 Data Rows

Don't wait for massive datasets. Begin with just 5-10 rows to get initial feedback, then continuously iterate and expand using production logs as your source of truth. This accelerates the feedback loop dramatically.

Treat User Feedback with Care

Thumbs up/down signals are inconsistent and temporally misaligned with current performance. Instead, extract the user's natural language request from feedback as the valuable signal for evaluation.

Use Traces for Production Debugging

Trace-based debugging provides visibility into multi-step AI applications, allowing you to drill down into individual spans and understand exactly how your application is performing in production.

Enable 100% Sampling for Critical Features

For production monitoring, configure online scoring to evaluate 100% of samples for critical traces. This ensures comprehensive observability rather than relying on sampled data.

Leverage Remote Eval for Complex Workflows

For complex workflows that can't be easily pushed to evaluation platforms, use remote eval to bridge local codebases with cloud evaluation tools while maintaining consistency with local development patterns.

Data Management Strategy

How to build and iterate on evaluation datasets

"In an AI interaction you can have five five LLM calls that happen between the user and like the notion AI response. We can extract just one of those and put it in a data set."

Extracting specific LLM calls from complex interactions

00:17:47
"Data sets are hand curated and it can be just one aspect of your trace."

Hand-curated datasets vs. logs

00:17:41
"So some tips here. We recommend for you to just get started. You don't need to create 200 rows in a data set to run your first eval. You know 5 10 rows is great. Just start small and get some feedback."

Start small: 5-10 rows for first eval

00:38:07
"And you know the next thing is don't stop iterating. keep adding rows or tweaking rows using logs as that that source of truth, right? How are your users interacting with the feature that you're developing?"

Continuous iteration using production logs

00:38:21
"Human reviews is another way of establishing ground truth. Very much needed in certain industries. If you're dealing in the medical space, you need doctors to look at the output or you need lawyers or people with highly specialized domain knowledge."

Human review for specialized domains

00:38:49

Understanding User Feedback

Why thumbs up/down can be misleading and how to extract real value

"We don't really rely on thumbs up for anything except for maybe internal notinos thumbs uping things as like good golden data for fine-tuning, but we don't really rely there's like no consistency in what makes someone thumbs up something."

Limitations of thumbs up feedback

00:18:54
"And for thumbs down data, um it's more just that this is a functionality that we know we didn't do our best work on. Um but that thumbs down could have been given in September of 2023."

Thumbs down doesn't reflect current performance

00:19:07
"So we don't perform how we did in September 2023. So it doesn't necessarily align with the LLM as a judge because that's judging a particular experiment, not um what the production user experience was at that time."

Temporal misalignment of feedback data

00:19:17
"And that's what makes this really powerful because we don't just need to look at what our output was in September 2023, right? Our data can be far more robust and last much longer because really what we're getting from the thumbs down is the natural language request from the user."

Value: extracting user intent from feedback

00:19:31

Automated Prompt Optimization

Notion's experience with automated prompt optimization tools

"The question was about automated prompt optimization. The answer is yes. We have played with it. I'm not sure that a majority of our problems are as solved by that as they could be in other workplace contexts."

Notion has experimented with automated prompt optimization

00:18:07
"I've been to like dinners and events where I've heard massive success from it. Maybe we haven't cracked it yet. Um, a lot of people have, but we've played with it."

Others report success, but Notion still exploring

00:18:17

Scoring and Normalization

Best practices for creating and normalizing evaluation scores

"Yeah, the question was about aligning thumbs up, thumbs down with scoring functions. Um, they don't I mean, so a majority of things in our data set were things that are either thumbs up or thumbs down."

Data composition: mostly thumbs up/down

00:18:32
"That's not necessarily what I think these scores are designed to do, right? But I can engineer my prompt and try to maximize some kind of score that I can get, right? But that does not mean that I have a range in which I try to operate."

Score optimization vs. operational ranges

00:56:30
"Yeah, it's definitely something we've heard before. Um I think that's part of the the struggle of writing scores is trying to normalize them and and bring them into that range of zero to one. So up to you of of what you decide the floor and ceiling are."

Normalizing scores to 0-1 range

00:56:54

Multi-Turn Conversation Evaluation

Strategies for evaluating complex agent workflows

"Do you have anything for evaluating multiurren conversations and as part of that would you be able to review the agent features?"

Multi-turn conversation evaluation

00:57:10
"So the idea is that you could provide a whole back and forth in in context and then evaluate that multi-turn conversation at once. You could do it as well in the SDK."

Providing full conversation context for evaluation

00:57:33
"Uh so here we could just grab the two but the idea is that the output of the initial prompt will become the input of the next and so on and then you can evaluate them as a unit right and as opposed to the the multi-turn right you're providing all the context at once"

Agents: chaining prompts with sequential evaluation

00:57:54
"With the multi-turn extra messages approach, you're providing everything at once to the LLM. So it's one LM call. So yeah, you would need to comply with the context window of the model that you're working with."

Context window considerations for multi-turn

00:58:41
"The agent feature though each prompt is its own LLM call."

Agents make multiple LLM calls

00:58:53

Trace-Based Debugging

Using traces and spans for production debugging

"Think of this as a as a trace and all of these uh sort of under the hood or underneath it are the individual spans that we actually want to understand."

Traces and spans hierarchy

01:17:17
"This this allows us the visibility into the multiple steps that the application will take and allows us like I like I uh showed you in that previous example to create scores for these individual spans as well."

Visibility into multi-step applications

01:17:25
"But then being able to drill down into these individual things right so what are my commits what is my latest excuse me release uh and then again being able to uh to drill down into these becomes really beneficial as we start to need to understand how our application is performing"

Drilling down into spans for debugging

01:17:47
"And this allows you to get very granular. Right now, we'll just configure all of these for the the overall trace, and we'll do it for 100% of the uh the the samples."

Granular scoring configuration

01:19:20

Production Monitoring with Online Scoring

Monitoring production AI applications with configured scores

"This is pre-production, right? This is uh now when we have our logs running in production, we want to understand how our application is performing uh relevant or excuse me uh relative to the scores that we've already configured."

Production monitoring with configured scores

01:18:19
"And this allows you to get very granular. Right now, we'll just configure all of these for the the overall trace, and we'll do it for 100% of the uh the the samples."

100% sampling for production evaluation

01:19:26

Remote Evaluation

Bridging local codebases with cloud evaluation platforms

"The difference is that you use the d-dev flag. So, this will start that remote eval server. It will default to local host, but you can change it. Um and it will allow it to bind to to all the it will allow it to listen to all the network interfaces so you could access it uh via yeah access remote servers out in the world."

Remote eval server for distributed evaluation

01:38:26
"Think of like the the more complex type of workflows that that you may have uh within your codebase and you don't have the ability necessarily to push those into brain trust."

Complex workflows that can't be easily pushed to Braintrust

01:40:25
"We can still leverage this platform with that existing codebase via remote eval. And this sort of like uh interaction works very similar to to what I showed earlier with uh the eval locally via the SDK as well as the the playground."

Remote eval bridges local code with Braintrust platform

01:40:36

Companies & Technologies

Tools and organizations mentioned

Notion

All-in-one workspace platform with AI integration. Sarah Sachs leads their AI team, focusing on evaluation-driven development.

Braintrust logo

Braintrust

Evaluation platform for AI products. Provides tools for datasets, scoring, trace-based debugging, and production monitoring.

Related Topics

Based on the AI Engineer Conference talk by Sarah Sachs (Notion) and Carlos Esteban (Braintrust).

Watch the full talk on YouTube