Replit Engineering

Autonomy Is All You Need

Building Long-Running AI Agents That Break the One-Hour Barrier

"We built a coding agent for nontechnical users. We want to build agents that run for several hours in a row without human intervention."

— Michele Catasta • Head of ML, Replit

Watch (10:40)

24 min

Talk Duration

2 Types

Of Autonomy

3 Gens

Agent Evolution

>1 Hour

Barrier Broken

Building autonomous agents that run for hours without human intervention requires fundamentally rethinking how we design and engineer AI systems. Michele Catasta, Head of ML at Replit, presents a comprehensive framework for achieving true autonomy—drawing insights from Tesla's Full Self-Driving journey and Replit's own evolution through three generations of agent architectures.

The talk introduces a critical distinction between two types of autonomy: supervised autonomy (human in the loop, like Tesla FSD with driver monitoring) and unsupervised autonomy (fully independent operation). Most AI systems today are stuck in supervised mode. Breaking through to unsupervised autonomy requires overcoming technical barriers around reliability, testing, observability, and system design.

Replit's journey from September 2023 reveals the practical evolution from ReAct agents (2022) to tool-calling agents (2023) to truly autonomous agents (2024). The breakthrough came with B3 capabilities—agents that can run for multiple hours without intervention by implementing four critical engineering practices: comprehensive testing, deep observability, reducible design, and massive parallelism.

The most compelling insight is the target user: not technical engineers, but nontechnical users who want to describe what they want and see it built. This focus drives all technical decisions and explains why autonomy matters—it's the difference between a helpful assistant and a transformative tool that democratizes software development.

Two Types of Autonomy

The Tesla FSD Analogy

Michele draws a powerful analogy to Tesla's Full Self-Driving system to explain the critical distinction between two fundamentally different approaches to autonomy.

Important Warning

Supervised Autonomy

The system operates autonomously but requires constant human monitoring. The human must remain attentive and ready to intervene at any moment.

Example:

"Tesla FSD requires the driver to monitor and be ready to take over."

Important Warning

Unsupervised Autonomy

The system operates completely independently for extended periods. The human can walk away and return hours later to find the task completed.

Example:

"We want agents that run for several hours in a row without human intervention."

The Key Insight

Most AI agents today operate in supervised mode—they can perform tasks but require constant human oversight. The breakthrough happens when you cross the threshold to unsupervised autonomy, where agents can operate reliably for hours without intervention. This is the difference between a helpful tool and a transformative technology that democratizes access to software development.

Three Generations of AI Agents

ReAct Agents (2022)

The first generation used the ReAct (Reasoning + Acting) pattern. Agents would reason about what to do, take an action, observe the result, and repeat. This was groundbreaking but fundamentally limited—the reasoning and acting were tightly coupled in a loop that couldn't scale to complex, long-running tasks.

"ReAct pattern: reason, act, observe, repeat. Simple but limited for complex tasks."

Tool-Calling Agents (2023)

The second generation separated tools from the model reasoning. Agents could call predefined tools (functions, APIs, commands) with structured inputs. This was a major step forward—tools could be versioned, tested, and improved independently. But agents still required constant human oversight and couldn't run for long periods.

"Tool calling enabled separation of concerns. Tools became first-class objects that could be improved independently."

Autonomous Agents (2024)

The current generation achieves true autonomy. Agents can run for multiple hours without human intervention by implementing robust engineering practices: comprehensive testing, deep observability, reducible design for debugging, and massive parallelism. This isn't just about better models—it's about building production-grade infrastructure around them.

"Agents that run for several hours in a row without human intervention. That's the breakthrough."

Breaking the One-Hour Barrier

The B3 Breakthrough

Replit achieved a major milestone with B3 capabilities—agents that can reliably run for more than one hour without human intervention. This might sound like a small detail, but it represents a fundamental shift from supervised to unsupervised autonomy.

Important Warning

Before B3

Agents would fail or get stuck after minutes. Constant human intervention required.

Important Warning

With B3

Agents run for hours independently. Complex multi-step tasks completed successfully.

"We want to build agents that run for several hours in a row without human intervention."

10:40

Why This Matters

Nontechnical users can describe what they want and walk away
Complex tasks can be broken into many steps without manual intervention
Overnight builds can run while you sleep, ready in the morning
Democratization of software development becomes realistic

Four Critical Engineering Practices

Achieving autonomous agents isn't about better prompts or smarter models—it's about implementing production-grade engineering practices. Michele outlines four non-negotiable practices.

Testing

Comprehensive testing at every level. Unit tests for tools, integration tests for workflows, end-to-end tests for complete tasks. You can't have autonomous agents without confidence that each component works correctly.

Key Insight: Testing is the foundation of autonomy. Without it, you can't trust agents to run unsupervised.

Observability

Deep visibility into agent behavior. What tools are being called? What's the reasoning? Where did it get stuck? Observability lets you debug failures and optimize performance in production.

Key Insight: You can't improve what you can't see. Observability is essential for reliable agents.

Reducible Design

Design systems so failures can be reduced to minimal reproducible cases. When an agent fails, you should be able to isolate the exact step, input, and context that caused the problem.

Key Insight: Debugging autonomous agents requires reducible failures. Make every bug reproducible.

Parallelism

Run multiple agent instances in parallel. Explore different approaches simultaneously. Compare results. Parallelism dramatically speeds up iteration and increases success rates.

Key Insight: Parallelism isn't just about speed—it's about exploring more solution paths and finding the best one.

Inside Replit's Journey to Autonomy

Timeline: From September 2023

Sep

September 2023 - Journey Begins

Replit starts building autonomous agents. Early experiments with ReAct and tool-calling patterns reveal limitations. Team realizes they need to rethink the entire architecture.

Q4 2023 - Tool-Calling Agents

Transition to tool-calling architecture. Tools become first-class objects. Early success but agents still require constant oversight. One-hour barrier seems impossible.

2024

Early 2024 - Engineering Focus

Shift from model improvements to engineering practices. Invest heavily in testing, observability, reducible design, and parallelism. Realize autonomy is a systems engineering problem.

B3 Launch - Barrier Broken

Agents can now run for multiple hours without human intervention. Focus shifts to nontechnical users—people who want to describe what they want and see it built without knowing how to code.

The Target User: Nontechnical Creators

"We built a coding agent for nontechnical users." This isn't about helping developers write code faster—it's about enabling people who have ideas but don't know how to program. The autonomous agent is their translator, converting natural language desires into working software. This focus on nontechnical users drives all technical decisions and explains why breaking the one-hour autonomy barrier is so critical.