2025 is the Year of Evals!
Just like 2024, and 2023, and …
John Dickerson, CEO of Mozilla AI (formerly co-founder/chief scientist at Arthur AI), explains why 2025 is genuinely different from previous "years of evals." Three converging forces—executive awareness, budget allocation, and autonomous agentic systems—are driving AI evaluation from niche concern to C-suite imperative.
"This is the year that a CEO is going to get fired because of an ML related screw-up. Uh and to my knowledge it just still hasn't happened."John Dickerson, Mozilla AI (00:06:22)
ChatGPT launch - AI accessible to executives
JPMC AI investment (2017-2021)
Executive awareness + Budget + Agentic
The breakthrough year
Why 2025 is Different
After years of being "the year of evals" (2023, 2024), 2025 is genuinely the breakthrough year because AI systems are now making autonomous decisions, requiring rigorous evaluation that connects directly to business KPIs and risk management.
The Core Thesis
ML monitoring and evaluation are "two sides of the same sword" - you can't have observability without measurement. This wasn't top-of-mind until recently because:
- ML outputs were "spitting out numbers" that got ingested into opaque systems
- Decision-makers prioritized security, latency, and other concerns over evaluation
- The "year of the CEO getting fired" pitch never materialized (no major ML catastrophes)
But in 2025: Agentic systems now ACT autonomously, introducing complexity and risk that makes evaluation mandatory.
The Three Converging Forces
Why evals are finally top-of-mind in 2025: three factors converged at the perfect moment.
Executive Awareness
November 30, 2022
ChatGPT launched, making AI accessible to non-technical executives. CEOs, CFOs, CISOs could finally understand and interact with AI.
Watch context (01:30)Budget Freeze Timing
October-November 2022
Fear of recession froze IT budgets for 2023. Exception: Money opened up for GenAI "pet projects."
"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."
Rise of Agentic Systems
2025
Systems now "acting for humans" and teams. Not just providing inputs into larger systems, but making autonomous decisions.
"Agents are starting to make decisions and take actions, complex steps that lead toward an action"
The Key Insight
"You have a lot of complexity introduced into the system and you have a lot of risk introduced into the system and that's great for those of us in eval." — Agentic systems that ACT autonomously create risks that never existed before, making evaluation mandatory rather than optional.
Historical Timeline: 2012-2025
Understanding the evolution of ML monitoring and why 2025 is the breakthrough moment.
Pre-ChatGPT Era: ML Monitoring Existed But Wasn't "The Thing"
Companies: H2O, Algorithmia, Seldon, Arize, Arthur, Galileo, Fiddler, Protect AI. ML models "spitting out numbers" that got ingested into opaque systems. The "Year of the CEO Getting Fired" pitch never happened—no major ML catastrophes.
JPMC Example: From 2017-2021, JPMC put $100 million into AI/ML. "That's not a huge amount of money for JPMC" — showing how little priority ML had pre-ChatGPT.
Year of Science Projects
ChatGPT awareness spreads. Austerity from frozen budgets forced enterprises to spend only on GenAI. Science projects floated around within enterprise: chat applications, internal hiring tools.
Going Into Production
GenAI applications going into production. Primarily internal deployments. "Folks who tend to dress in business suits" started asking questions about ROI, governance, risk, compliance, brand optics.
The Year of Evals (Shipping & Scaling)
Science projects from 2023 → production in 2024 → shipping and scaling in 2025. Technologies "gotten really amazing." Revenue numbers for frontier model providers "going up right now."
"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
C-Suite Alignment: Why Enterprises Are Buying
All five C-suite executives now control budgets and are aligned on the need for AI evaluation. Each has different motivations but same conclusion.
CEO
Now Understands AI: ChatGPT made AI tangible. Can allocate budget, talk to boards, understand capabilities of generative and agentic systems.
"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."
CFO
Needs Quantitative Evaluation: Writing numbers into Excel spreadsheets. Doing allocation and budget planning. Those numbers have to come from evaluation.
CISO
Sees Security Risk & Opportunity: More "scrappy" than CIO. Willing to write smaller checks more quickly. Focus: hallucination detection, prompt injection.
CIO
Wants to Keep Job: Concerned about nothing breaking. Conservative deployment approach. Needs evaluation to ensure stability.
CTO
Always Wants Standards: Needs to make decisions based on numbers from evals. Standardization, consistency, tech sprawl prevention.
The Alignment Insight
All these executives now "control a lot of budget and now they are all willing to talk about and they're all aligned about basically the need to understand evaluation from AI." This is why 2025 is different—C-suite alignment on budget + need.
Market Validation: Revenue Numbers Don't Lie
The proof is in the numbers. Evaluation startups are seeing hockey stick growth.
The Information Article Leak (Mid-April)
Leaked revenue numbers for evaluation startups: Arize, Galileo, Brain Trust, and others. Numbers were "lagged by about six months or eight months."
"From talking to friends in the space, those numbers are no longer representative of what folks in this area are making"
Early 2026 Prediction
Expecting early 2026 leaks to show "revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
Industry Consensus: Multi-Agent Systems Monitoring
All evaluation/observability/monitoring/security companies shifted to "multi-agent systems monitoring." Well understood in industry and government: "You should monitor the whole system. You shouldn't just monitor the one model that is being used by one particular agent."
Q&A: Practical Implementation Insights
Real-world insights from the Q&A session on how evaluation is actually being implemented.
Q: How do you solve domain expertise in evaluation?
Most evaluations in GenAI require domain expertise. How is this problem getting solved given it's unstructured data and you have to measure quality "like is it acting like a human"?
A: Expert Validation at $50-200/hour
Leaked Merkor spreadsheet showing experts hired at $50-$200/hour. Companies like Google, Meta, large banks hiring experts. These experts sit "alongside the multi-agent system" doing "expensive human validation."
Use case: Critical tasks like discounted cash flow analysis "where you can either make or lose a lot of money and lose your job if you get it wrong."
"It's worth spending that large amount of money doing the human validation"
Q: Timeline on LLMs-as-Judges for evaluation?
Rough timeline on when eval will primarily be driven by GenAI or LLMs as judges?
A: "Poor Man's Version of a Human" — With Biases
The LLM-as-a-judge paradigm "is getting used in practice because there are issues with it." Paper in ICLR last month about biases LLMs-as-judges have versus humans (conciseness, helpfulness, "some of those anthropic words").
Pros: Solves the dataset creation problem. Can give a persona to an LLM and "it is like a poor man's version of a human doing the judging."
Cons: "You need to make sure you're validating this and and making sure that you're not going off in some weird bias direction"
"It's the data set creation and the environment creation that matters more than anything"
Top 10 Quotes from the Talk
The most impactful, quotable moments with verified YouTube timestamps.
"That discretionary budget was unlocked specifically for now the CEO's pet projects which were called GenAI."
How frozen budgets + ChatGPT launch created perfect timing
Watch (00:09:12)"Agents are starting to make decisions and take actions, complex steps that lead toward an action"
Why agentic systems change everything
Watch (00:02:58)"If I need to have a quantitative estimate of risk, then I need to do evaluation."
The necessity of evaluation for risk management
Watch (00:10:17)"Revenue no longer lags at AI evaluation startups because this is the year for AI evaluation."
Market validation from leaked revenue numbers
Watch (00:15:42)"It's the data set creation and the environment creation that matters more than anything"
The most important insight from Q&A
Watch (00:17:57)"Enable the open source community to be at the same table as a Sam Altman"
Mozilla AI's mission
Watch (00:00)"ML monitoring and evaluation as two sides of the same sword or ruler, right? You can't do monitoring or observability without being able to measure and measurement is the core functionality for evaluation."
Framework thinking
Watch (01:23)"If I need to have a quantitative estimate of risk, then I need to do evaluation."
Risk management necessity
Watch (10:17)"Those numbers have to come from, in part, quantitative evaluation."
CFO writing numbers into Excel spreadsheets
Watch (13:22)Key Takeaways
1. 2025 is Genuinely Different
Breakthrough Year
- •Three converging forces: executive awareness, budget timing, agentic systems
- •AI systems now make autonomous decisions, not just provide inputs
- •C-suite alignment on budget + need for evaluation
- •Revenue leaks show hockey stick growth in eval startups
2. Budget Dynamics Matter
Executive Buy-in
- •Budget freeze (Oct-Nov 2022) + ChatGPT launch (Nov 30, 2022) = perfect timing
- •GenAI became the only allowable spending during austerity
- •CEO's 'pet projects' unlocked discretionary budget
- •JPMC example: $100M from 2017-2021 shows how little priority ML had pre-ChatGPT
3. Agentic Systems Change Everything
New Risk Profile
- •Agents perceive, learn, abstract, reason AND ACT
- •Autonomous decision-making creates complexity and risk
- •Before: Chatbots with human in the loop (lower risk, optional monitoring)
- •After: Agentic systems acting autonomously (higher risk, mandatory evaluation)
4. All C-Suite Executives Aligned
Business Case
- •CEO: Now understands AI, allocating budget
- •CFO: Needs quantitative evaluation for Excel spreadsheets
- •CISO: Security risk, willing to write smaller checks quickly
- •CIO: Job security, wants stability
- •CTO: Always wants standards, needs numbers for decisions
5. Practical Implementation Insights
Q&A Takeaways
- •Expert validation costs $50-200/hour (Merkor leaked spreadsheet)
- •LLM-as-a-judge is 'poor man's version of a human' but has biases
- •Most important: 'It's the data set creation and the environment creation that matters more than anything'
- •Multi-agent systems monitoring: monitor the whole system, not just one model
6. Market Validation is Real
Revenue Growth
- •The Information article leaked revenue numbers (6-8 months old, already outdated)
- •Expecting early 2026 leaks to show hockey stick growth
- •All eval/obs/monitoring/security companies shifted to multi-agent monitoring
- •Industry and government consensus: system-level evaluation
Key Video Timestamps
Jump to specific sections of the talk for deeper exploration.
Source Video
2025 is the Year of Evals! Just like 2024, and 2023, and …
John Dickerson • CEO, Mozilla AI
Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers the three converging forces framework, historical timeline 2012-2025, C-suite alignment analysis, market validation from leaked revenue data, and Q&A insights on domain expertise and LLM-as-a-judge limitations.
Key Concepts: Three converging forces, executive awareness, budget freeze timing, agentic systems, C-suite alignment, ROI metrics, business KPIs, risk management, multi-agent systems monitoring, expert validation ($50-200/hr), LLM-as-a-judge biases, dataset creation, environment creation, market validation, revenue growth