AI Engineering

Product Development

Evaluation

Notion

Braintrust

How to Build World-Class AI Products

Sarah Sachs (AI Lead, Notion) and Carlos Esteban (Braintrust) share their evaluation-first approach to building AI products that scale.

Sarah Sachs

Carlos Esteban

Watch on YouTube

"All of the rigor and excellence that comes from building great AI products comes from observability and good evals and that's how you scale an engineering team. Ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals."

00:01:56 - Core philosophy: Evaluation over prompting

Key Metrics

Notion's approach to AI development in numbers

Time on Evaluation

90%

of engineering time spent on evals and iteration

Time on Prompting

10%

of engineering time spent on prompt engineering

Initial Dataset Size

5-10 rows

recommended for first evaluation run

Production Sampling

100%

for critical feature traces

Core Philosophy: Evaluation First

Why Notion spends 90% of their time on evaluation

"At a higher level, what I tell the team and what I tell people who are also building is like all of the rigor and excellence that comes from building great AI products comes from observability and good evals."

Foundation of AI product excellence

00:01:48

"That's how you scale an engineering team. That's how you build good product and ultimately we spend maybe 10% of our time prompting and 90% of our time looking at evals and iterating on our evals and looking at our usage."

Time allocation: 10% prompting, 90% evaluation

00:02:00

"That I believe is the right balance of work in order to know that you're not just shipping something that worked well in a demo with your VP or that you got working on the cow train to work and finally it did the thing that you wanted it to do."

Avoiding demo-driven development traps

00:02:14

Key Takeaways

Actionable insights from Notion and Braintrust

Spend 90% on Evaluation, 10% on Prompting

Notion's AI team allocates the vast majority of their time to evaluation, observability, and iteration rather than prompt engineering. This shift from demo-driven development to data-driven iteration ensures products work consistently in production.

Start Small with 5-10 Data Rows

Don't wait for massive datasets. Begin with just 5-10 rows to get initial feedback, then continuously iterate and expand using production logs as your source of truth. This accelerates the feedback loop dramatically.

Treat User Feedback with Care

Thumbs up/down signals are inconsistent and temporally misaligned with current performance. Instead, extract the user's natural language request from feedback as the valuable signal for evaluation.

Use Traces for Production Debugging

Trace-based debugging provides visibility into multi-step AI applications, allowing you to drill down into individual spans and understand exactly how your application is performing in production.

Enable 100% Sampling for Critical Features

For production monitoring, configure online scoring to evaluate 100% of samples for critical traces. This ensures comprehensive observability rather than relying on sampled data.

Leverage Remote Eval for Complex Workflows

For complex workflows that can't be easily pushed to evaluation platforms, use remote eval to bridge local codebases with cloud evaluation tools while maintaining consistency with local development patterns.

Data Management Strategy

How to build and iterate on evaluation datasets

"In an AI interaction you can have five five LLM calls that happen between the user and like the notion AI response. We can extract just one of those and put it in a data set."

Extracting specific LLM calls from complex interactions