Market Breakthrough

2025 is the Year of Evals!

Just like 2024, and 2023, and …

John Dickerson, CEO of Mozilla AI (formerly co-founder/chief scientist at Arthur AI), explains why 2025 is genuinely different from previous "years of evals." Three converging forces—executive awareness, budget allocation, and autonomous agentic systems—are driving AI evaluation from niche concern to C-suite imperative.

"This is the year that a CEO is going to get fired because of an ML related screw-up. Uh and to my knowledge it just still hasn't happened."
John Dickerson, Mozilla AI (00:06:22)
Nov 2022

ChatGPT launch - AI accessible to executives

$100M

JPMC AI investment (2017-2021)

3 Forces

Executive awareness + Budget + Agentic

2025

The breakthrough year

Why 2025 is Different

After years of being "the year of evals" (2023, 2024), 2025 is genuinely the breakthrough year because AI systems are now making autonomous decisions, requiring rigorous evaluation that connects directly to business KPIs and risk management.

The Core Thesis

ML monitoring and evaluation are "two sides of the same sword" - you can't have observability without measurement. This wasn't top-of-mind until recently because:

  • ML outputs were "spitting out numbers" that got ingested into opaque systems
  • Decision-makers prioritized security, latency, and other concerns over evaluation
  • The "year of the CEO getting fired" pitch never materialized (no major ML catastrophes)

But in 2025: Agentic systems now ACT autonomously, introducing complexity and risk that makes evaluation mandatory.

The Three Converging Forces

Why evals are finally top-of-mind in 2025: three factors converged at the perfect moment.

Executive Awareness

November 30, 2022

ChatGPT launched, making AI accessible to non-technical executives. CEOs, CFOs, CISOs could finally understand and interact with AI.

Watch context (01:30)

Budget Freeze Timing

October-November 2022

Fear of recession froze IT budgets for 2023. Exception: Money opened up for GenAI "pet projects."

"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."
Watch explanation (09:12)

Rise of Agentic Systems

2025

Systems now "acting for humans" and teams. Not just providing inputs into larger systems, but making autonomous decisions.

"Agents are starting to make decisions and take actions, complex steps that lead toward an action"
Watch explanation (02:58)

The Key Insight

"You have a lot of complexity introduced into the system and you have a lot of risk introduced into the system and that's great for those of us in eval." — Agentic systems that ACT autonomously create risks that never existed before, making evaluation mandatory rather than optional.

Historical Timeline: 2012-2025

Understanding the evolution of ML monitoring and why 2025 is the breakthrough moment.

2012-2022

Pre-ChatGPT Era: ML Monitoring Existed But Wasn't "The Thing"

Companies: H2O, Algorithmia, Seldon, Arize, Arthur, Galileo, Fiddler, Protect AI. ML models "spitting out numbers" that got ingested into opaque systems. The "Year of the CEO Getting Fired" pitch never happened—no major ML catastrophes.

JPMC Example: From 2017-2021, JPMC put $100 million into AI/ML. "That's not a huge amount of money for JPMC" — showing how little priority ML had pre-ChatGPT.

2023

Year of Science Projects

ChatGPT awareness spreads. Austerity from frozen budgets forced enterprises to spend only on GenAI. Science projects floated around within enterprise: chat applications, internal hiring tools.

2024

Going Into Production

GenAI applications going into production. Primarily internal deployments. "Folks who tend to dress in business suits" started asking questions about ROI, governance, risk, compliance, brand optics.

2025

The Year of Evals (Shipping & Scaling)

Science projects from 2023 → production in 2024 → shipping and scaling in 2025. Technologies "gotten really amazing." Revenue numbers for frontier model providers "going up right now."

"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
Watch (15:42)

C-Suite Alignment: Why Enterprises Are Buying

All five C-suite executives now control budgets and are aligned on the need for AI evaluation. Each has different motivations but same conclusion.

CEO

Now Understands AI: ChatGPT made AI tangible. Can allocate budget, talk to boards, understand capabilities of generative and agentic systems.

"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."

CFO

Needs Quantitative Evaluation: Writing numbers into Excel spreadsheets. Doing allocation and budget planning. Those numbers have to come from evaluation.

Needs ROI metrics and risk quantification

CISO

Sees Security Risk & Opportunity: More "scrappy" than CIO. Willing to write smaller checks more quickly. Focus: hallucination detection, prompt injection.

Guardrail products entered CISO's office early

CIO

Wants to Keep Job: Concerned about nothing breaking. Conservative deployment approach. Needs evaluation to ensure stability.

Prioritizes stability and risk mitigation

CTO

Always Wants Standards: Needs to make decisions based on numbers from evals. Standardization, consistency, tech sprawl prevention.

Mandates evaluation tools and processes

The Alignment Insight

All these executives now "control a lot of budget and now they are all willing to talk about and they're all aligned about basically the need to understand evaluation from AI." This is why 2025 is different—C-suite alignment on budget + need.

Market Validation: Revenue Numbers Don't Lie

The proof is in the numbers. Evaluation startups are seeing hockey stick growth.

The Information Article Leak (Mid-April)

Leaked revenue numbers for evaluation startups: Arize, Galileo, Brain Trust, and others. Numbers were "lagged by about six months or eight months."

"From talking to friends in the space, those numbers are no longer representative of what folks in this area are making"

Early 2026 Prediction

Expecting early 2026 leaks to show "revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."

"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
Watch (15:42)

Industry Consensus: Multi-Agent Systems Monitoring

All evaluation/observability/monitoring/security companies shifted to "multi-agent systems monitoring." Well understood in industry and government: "You should monitor the whole system. You shouldn't just monitor the one model that is being used by one particular agent."

Q&A: Practical Implementation Insights

Real-world insights from the Q&A session on how evaluation is actually being implemented.

Q: How do you solve domain expertise in evaluation?

Most evaluations in GenAI require domain expertise. How is this problem getting solved given it's unstructured data and you have to measure quality "like is it acting like a human"?

A: Expert Validation at $50-200/hour

Leaked Merkor spreadsheet showing experts hired at $50-$200/hour. Companies like Google, Meta, large banks hiring experts. These experts sit "alongside the multi-agent system" doing "expensive human validation."

Use case: Critical tasks like discounted cash flow analysis "where you can either make or lose a lot of money and lose your job if you get it wrong."

"It's worth spending that large amount of money doing the human validation"
Watch Q&A (16:15)

Q: Timeline on LLMs-as-Judges for evaluation?

Rough timeline on when eval will primarily be driven by GenAI or LLMs as judges?

A: "Poor Man's Version of a Human" — With Biases

The LLM-as-a-judge paradigm "is getting used in practice because there are issues with it." Paper in ICLR last month about biases LLMs-as-judges have versus humans (conciseness, helpfulness, "some of those anthropic words").

Pros: Solves the dataset creation problem. Can give a persona to an LLM and "it is like a poor man's version of a human doing the judging."

Cons: "You need to make sure you're validating this and and making sure that you're not going off in some weird bias direction"

"It's the data set creation and the environment creation that matters more than anything"
Watch Q&A (17:45)

Top 10 Quotes from the Talk

The most impactful, quotable moments with verified YouTube timestamps.

"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."

How frozen budgets + ChatGPT launch created perfect timing

Watch (00:09:12)
"Agents are starting to make decisions and take actions, complex steps that lead toward an action"

Why agentic systems change everything

Watch (00:02:58)
"If I need to have a quantitative estimate of risk, then I need to do evaluation."

The necessity of evaluation for risk management

Watch (00:10:17)
"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."

Market validation from leaked revenue numbers

Watch (00:15:42)
"It's the data set creation and the environment creation that matters more than anything"

The most important insight from Q&A

Watch (00:17:57)
"Enable the open source community to be at the same table as a Sam Altman"

Mozilla AI's mission

Watch (00:00)
"ML monitoring and evaluation as two sides of the same sword or ruler, right? You can't do monitoring or observability without being able to measure and measurement is the core functionality for evaluation."

Framework thinking

Watch (01:23)
"If I need to have a quantitative estimate of risk, then I need to do evaluation."

Risk management necessity

Watch (10:17)
"Those numbers have to come from, in part, quantitative evaluation."

CFO writing numbers into Excel spreadsheets

Watch (13:22)

Key Takeaways

1. 2025 is Genuinely Different

Breakthrough Year

  • Three converging forces: executive awareness, budget timing, agentic systems
  • AI systems now make autonomous decisions, not just provide inputs
  • C-suite alignment on budget + need for evaluation
  • Revenue leaks show hockey stick growth in eval startups

2. Budget Dynamics Matter

Executive Buy-in

  • Budget freeze (Oct-Nov 2022) + ChatGPT launch (Nov 30, 2022) = perfect timing
  • GenAI became the only allowable spending during austerity
  • CEO's 'pet projects' unlocked discretionary budget
  • JPMC example: $100M from 2017-2021 shows how little priority ML had pre-ChatGPT

3. Agentic Systems Change Everything

New Risk Profile

  • Agents perceive, learn, abstract, reason AND ACT
  • Autonomous decision-making creates complexity and risk
  • Before: Chatbots with human in the loop (lower risk, optional monitoring)
  • After: Agentic systems acting autonomously (higher risk, mandatory evaluation)

4. All C-Suite Executives Aligned

Business Case

  • CEO: Now understands AI, allocating budget
  • CFO: Needs quantitative evaluation for Excel spreadsheets
  • CISO: Security risk, willing to write smaller checks quickly
  • CIO: Job security, wants stability
  • CTO: Always wants standards, needs numbers for decisions

5. Practical Implementation Insights

Q&A Takeaways

  • Expert validation costs $50-200/hour (Merkor leaked spreadsheet)
  • LLM-as-a-judge is 'poor man's version of a human' but has biases
  • Most important: 'It's the data set creation and the environment creation that matters more than anything'
  • Multi-agent systems monitoring: monitor the whole system, not just one model

6. Market Validation is Real

Revenue Growth

  • The Information article leaked revenue numbers (6-8 months old, already outdated)
  • Expecting early 2026 leaks to show hockey stick growth
  • All eval/obs/monitoring/security companies shifted to multi-agent monitoring
  • Industry and government consensus: system-level evaluation

Source Video

2025 is the Year of Evals! Just like 2024, and 2023, and …

John Dickerson • CEO, Mozilla AI

Video ID: CQGuvf6gSrMDuration: ~19 minutesEvent: AI Engineer Conference
AI evaluation
ML monitoring
agentic systems
enterprise AI
C-suite alignment
Watch on YouTube

Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers the three converging forces framework, historical timeline 2012-2025, C-suite alignment analysis, market validation from leaked revenue data, and Q&A insights on domain expertise and LLM-as-a-judge limitations.

Key Concepts: Three converging forces, executive awareness, budget freeze timing, agentic systems, C-suite alignment, ROI metrics, business KPIs, risk management, multi-agent systems monitoring, expert validation ($50-200/hr), LLM-as-a-judge biases, dataset creation, environment creation, market validation, revenue growth

Research sourced from AI Engineer Conference transcript. Speaker: John Dickerson, CEO Mozilla AI (formerly co-founder/chief scientist at Arthur AI, 6 years). Analysis covers why 2025 is genuinely different from previous "years of evals" through the lens of three converging forces, C-suite alignment, and market validation. All quotes verified against original VTT transcript with exact timestamps.