Production Reliability

Production Software Keeps Breaking and It Will Only Get Worse

Anish Agarwal from Traversal.ai explains why AI writes code faster but debugging gets harder. A crisis in AI development demanding a new approach combining causal ML, LLMs, and swarms of agents.

"I think if we just continue in the way we're going, most of our time I think is going to be spent doing on call for the vast majority of us."

— Anish Agarwal, CEO Traversal.ai • 2:05

~18 min

Duration

Comprehensive framework

40%

MTTR Reduction

DigitalOcean case study

~5 min

Resolution Time

Average incident fix

40-60

Engineers Impacted

Scaled incident response

The Crisis: AI Writes Code Faster, But Debugging Gets Harder

We're witnessing a paradox in AI development. Tools like Cursor, Windsurf, and GitHub Copilot have dramatically accelerated code generation, but they've also created a dangerous gap: humans now write less code, understand less context, and are increasingly unable to troubleshoot the complex systems they've built.

Anish Agarwal, CEO of Traversal.ai, frames software engineering in three categories: system design (creative architectural work), development (writing code and DevOps—being automated by AI), and troubleshooting (debugging production incidents, which is becoming exponentially harder).

The current workflow—what Anish calls "dashboard dumpster diving"—doesn't scale. Engineers search through thousands of dashboards (Grafana, Datadog, Splunk, Elastic, Sentry) looking for anomalous patterns. When that fails, they stare at codebases waiting for inspiration. This loop continues until resolved, often with 30-100 people in Slack channels.

The Grim Reality of On-Call Work

"And that would be kind of a sad existence for ourselves if that's what happens, right?" — Anish Agarwal warns that without automated troubleshooting, most engineers will spend most of their time on on-call duty instead of creative work. The bottleneck has shifted from writing code to understanding why it broke.

The Problem: Dashboard Dumpster Diving

Current incident response workflow is broken. Engineers cycle through two stages, neither of which scales with AI-written code complexity.

Dashboards and Alerts

Engineers stare at Grafana, Datadog, New Relic, and thousands of other dashboards—looking for anomalies, spikes, patterns that explain the incident

Code and Logs

When dashboards fail, engineers dive into codebases, grep logs, trace through distributed systems—waiting for inspiration to strike

Human Context Loss

The scaling problem: With AI-written code, no human has full context of how the system works. Incident channels balloon to 30-100 engineers searching in parallel. The loop continues until someone finds the root cause—often hours later.

Why Current Solutions Fail

Three popular approaches to automated troubleshooting all fail for fundamental reasons.

AIOps: Too Many False Positives

• Traditional anomaly detection generates thousands of alerts

• Signal buried in noise—operators learn to ignore warnings

• Can't distinguish between correlated symptoms and root causes

LLMs: Can't Handle Scale

• Petabytes of logs don't fit in context windows

• Data doesn't fit in memory—or even in entire clusters

• RAG helps but still limited by retrieval quality

React-Style Agents: Runbooks Deprecated

• Static playbooks brittle to changing systems

• Runbooks deprecated by the time they're built

• Sequential tool calling too slow (need 2-5 min resolution)

The Fundamental Gap

None of these solutions address the core problem: understanding causal relationships in complex systems, not just correlations. You need correlation filtering (statistics), semantic understanding (LLMs), and massive parallelism (swarms)—all three working together.

The Solution: Three-Part Framework

Traversal.ai combines three approaches into a unified system for autonomous troubleshooting.

Statistics: Correlation vs Causation

Causal ML techniques identify root causes, not just correlations. Distinguish between symptoms (correlated failures) and underlying issues.

"Correlation doesn't imply causation—but in troubleshooting, you need both"

Semantics: Understanding Logs and Code

LLMs provide semantic understanding of log messages, error patterns, and code context. Translate technical jargon into human-readable insights.

"LLMs bridge the gap between raw data and actionable insights"

Swarms: Parallel Tool Execution

Thousands of agents running in parallel, each making independent tool calls. Massive parallelism for rapid hypothesis testing.

"Swarm intelligence emerges from coordinated parallel exploration"

How It Works Together

1. Statistics identifies potential root causes by filtering correlated failures

2. Semantics understands context and generates testable hypotheses

3. Swarms test hypotheses in parallel at massive scale (thousands of agents)

4. Human operator validates findings and implements fixes

DigitalOcean Case Study: Real-World Impact

DigitalOcean

Cloud computing platform • 40-60 engineers involved

Before Traversal vs After Traversal

MTTR

Hours of manual searching → 40% reduction

Process

Dashboard diving + code staring → Automated causal analysis

Resolution

Slow, context loss across team → ~5 minutes average

Outcome

40-60 people in incident channels → Faster fixes, less toil

MTTR Reduction

40%

Mean time to resolve incidents reduced by nearly half

Avg Resolution

~5 min

Automated analysis and hypothesis testing

Engineers

40-60

Empowered with AI-driven troubleshooting

Hypotheses

1000s

Tested in parallel by swarms of agents

"We've measured that about 40% reduction in the amount of time that it takes to find and resolve production incidents."

— Matt, First Employee at Traversal.ai

DigitalOcean measured the impact of Traversal's autonomous troubleshooting platform

13:44

Top 12 Quotes from the Talk

Direct quotes from the YouTube video with timestamped links for verification

Production Reality

"I think if we just continue in the way we're going, it's actually going to look the opposite is that most of our time I think is going to be spent doing on call for the vast majority of us."

Anish Agarwal

CEO, Traversal.ai