AI Engineer Summit Keynote

2024

The AI Developer Experience Doesn't Have to Suck

Why and how Modal rebuilt cloud infrastructure from scratch to achieve sub-second GPU cold starts

Eric Bernhardson•CEO, Modal (ex-Spotify)

45-50 minutes

"Cloud has been a phenomenal invention... but it's arguably a step backwards in terms of developer experience. Solving container cold start in a distributed system is a very, very deep rabbit hole."

— Eric Bernhardson, CEO of Modal

1.5s

Cold Start Time

20-60x faster than traditional containers (30-90s)

Seconds

H100 Access

Instant GPU allocation vs weeks of capacity planning

$50K

Startup Credits

Up to $50,000 in free compute for qualified startups

5-10x

Cost Reduction

Through GPU pooling and optimization vs traditional cloud

The Cloud Bottleneck Nobody Talks About

For years, AI developers accepted a painful truth: cloud infrastructure makes you slower. While modern AI models train in minutes, the surrounding infrastructure moves at a glacial pace. You spin up containers, wait for cold starts, configure Kubernetes clusters, and spend hours on tasks that have nothing to do with your actual AI work. This infrastructure tax eats into development time, kills momentum, and creates friction between having an idea and testing it.

The problem runs deeper than just inconvenience. When every experiment requires provisioning infrastructure, you automatically become conservative. You iterate less because starting up takes too long. You postpone tests because setup costs are too high. The container paradigm—revolutionary for web services—becomes a ball and chain for AI development, where rapid experimentation is everything.

Modal took a different approach. Instead of optimizing containers, they deleted them entirely. "We've built custom infrastructure from scratch. Not containers. Our own runtime." This fundamental rethinking eliminated cold starts entirely. By snapshotting memory state and using content-addressed storage, Modal can restore a running Python environment in approximately 1.5 seconds—not minutes. This breakthrough transforms the developer experience from "plan carefully before executing" to "experiment freely and iterate fast."

Key Insights

8 major insights from the keynote

1. The Container Model Is Broken for AI Development

"We've built custom infrastructure from scratch. Not containers. Our own runtime."

Containers were designed for long-running web services, not bursty AI workloads. For AI development, where you need to run hundreds of short experiments, container overhead becomes crushing. Modal's decision to abandon containers entirely and build a custom runtime represents a fundamental architectural shift—treating infrastructure as a first-class citizen for AI workloads, not an afterthought inherited from web development.

2. Memory Snapshotting Changes Everything

"It's not like you just snapshot your code. You snapshot your entire state. All your memory state. All the pages in memory. The CPU state. The register state. The file descriptor table. Everything."

Traditional containers restart from scratch every time, pulling dependencies and initializing environments. Modal's snapshotting captures the entire running state—memory, CPU, file descriptors—and stores it as an immutable artifact. This means your Python environment with PyTorch, CUDA, and all dependencies loads in ~1.5 seconds, not the 30-90 seconds typical of container cold starts. For iterative AI development, this 20-60x speedup is transformative.

3. Content-Addressed Storage Enables Global Scale

"The cool thing about content-addressed storage is that it's inherently deduplicated."

Content-addressed storage means files are identified by their cryptographic hash, not their location. If two users need the same Python package, Modal stores it once globally. This creates massive efficiency gains: popular dependencies like PyTorch or CUDA libraries exist in a single global cache, accessible from any region. The system scales automatically because storage costs decrease as more users join—a rare example of a system that gets more efficient with scale.

4. GPU Pooling Through Statistical Multiplexing

"We don't need to reserve GPUs for you. We can take advantage of the fact that we know when your container is running, when it's idle. We can pack workloads together much more tightly."

Traditional cloud GPU allocation reserves entire GPUs for specific customers, leading to massive waste—estimated at 80-90% underutilization. Because Modal controls the entire runtime, it knows exactly when containers are active versus idle. This enables "statistical multiplexing"—running multiple workloads on the same GPU at different times, dramatically improving utilization and reducing costs by 3-5x compared to traditional cloud providers.

5. Multi-Cloud Capacity Is a Hidden Complexity Nightmare

"We have to fetch across clouds. We have GPU capacity in GCP, in Azure, in AWS. But your snapshot might live somewhere else."

When you operate at cloud scale, you face a reality that most developers never see: GPU shortages are regional and constant. AWS might have no H100s in us-east-1 today, while Azure has surplus in the same region. Modal's content-addressed storage allows transparent multi-cloud fetching—your snapshot might live in AWS, but your GPU is in Azure. The system handles this complexity invisibly, ensuring you always get capacity regardless of which cloud provider actually has it.

6. The Feedback Loop Is the Productivity Metric

"If I can do something in five seconds, I'll try it. If it takes 15 minutes, I'll think twice."

This simple observation explains why infrastructure friction matters so much. Short feedback loops encourage experimentation; long feedback loops encourage planning and caution. For AI development, where progress comes from rapid iteration, the difference between a 5-second test and a 15-minute test is the difference between trying 50 ideas per day versus 5. Modal's 1.5-second startup time creates a fundamentally different development mindset—one optimized for discovery rather than caution.

7. Simplicity Wins: No YAML, No Infrastructure Code

"You just write Python functions. You don't write YAML. You don't write infrastructure code."

Every line of infrastructure code is a line not spent on your actual problem. Infrastructure-as-code tools like Terraform and Helm reduce some complexity but introduce new layers of abstraction. Modal's approach—just write Python functions—eliminates the entire category of infrastructure code. Your business logic and your deployment logic become the same thing, reducing cognitive load and eliminating the "it works on my machine but not in production" class of bugs entirely.

8. Sub-Second Cold Starts Are Achievable

"We can get it down to about 1.5 seconds for cold start. That's pretty fast. That includes pulling your snapshot, starting the container, everything."

For comparison, traditional container cold starts typically take 30-90 seconds for AI environments. A 20-60x speedup transforms the development experience from "plan carefully" to "experiment freely." This breakthrough comes from Modal's custom infrastructure—content-addressed storage eliminates data transfer overhead, memory snapshotting skips initialization entirely, and aggressive prefetching anticipates file access patterns.

Technical Deep Dive

How Modal solved the container cold start problem

Content-Addressed Storage: The Efficiency Engine

What It Is: Content-addressed storage is a paradigm where files are stored and retrieved based on their cryptographic hash (typically SHA-256), not their filename or location. When you store a file, the system calculates its hash and uses that as the address. When you retrieve it, you provide the hash, and the system returns the file.

Why It Matters for AI Workloads:

Automatic Deduplication: If 100 users need the same PyTorch installation, it's stored exactly once
Global Availability: Your snapshot is accessible from any region without re-download
Integrity Verification: The hash serves as both address and integrity check
Cache-Friendly: Popular dependencies propagate to edge locations automatically

Modal built a custom content-addressed storage system optimized for AI workloads. Rather than treating each file independently, they understand the relationship between files—Python packages, CUDA libraries, model weights—and can optimize storage and retrieval patterns accordingly.

Memory Snapshotting with gVisor: The Breakthrough Innovation

The Challenge: Snapshotting a running process is notoriously difficult. You need to capture all memory pages, CPU register state, file descriptors, network connections, and internal kernel state. Standard Linux tools like CRIU exist but are complex and fragile, especially for GPU workloads where CUDA context must also be preserved.

Modal's Solution: Modal extended gVisor—a user-space kernel implementation originally developed by Google—to add comprehensive checkpointing capabilities. gVisor provides a clean abstraction layer between the application and the Linux kernel, making it possible to capture and restore process state reliably.

How It Works:

1.Capture Phase: When a container finishes executing, Modal captures all user-space memory, CPU state, file descriptors, network sockets, and GPU memory/CUDA context
2.Storage Phase: The captured state is serialized and stored in content-addressed storage. Because it's immutable, it can be cached and replicated globally
3.Restore Phase: When a new container needs to run, the system locates the snapshot, streams it to the target host, and restores the process state in ~1.5 seconds

Why It's Revolutionary: Traditional container startup is "pull image → extract files → install dependencies → initialize → run." Modal's startup is "locate snapshot → restore → run." By skipping all the initialization work, Modal achieves 20-60x faster cold starts. For iterative AI development, where you might start hundreds of containers per day, this is the difference between infrastructure being a minor annoyance versus a major bottleneck.

GPU Pooling Economics: The Cost Optimization Engine

Traditional Cloud GPU Allocation

• You reserve a GPU (e.g., 1x H100 for $3/hour)
• You pay 24/7, regardless of usage
• Typical dev workload: 2 hours active/day
• Effective cost: $36/hour of actual compute
• Utilization: ~8%

Modal's GPU Pooling

• You pay only when your code runs
• Multiple users share the same GPU
• Typical dev workload: 2 hours active/day
• Effective cost: $3-6/hour of actual compute
• Utilization: 60-80%

The Math: For the same workload, Modal costs 5-10x less because you're not paying for idle time. Traditional cloud providers reserve GPUs because they lack visibility into workload patterns. Modal's custom runtime gives perfect visibility, enabling safe sharing and dramatic cost reductions.

Real Quotes from the Keynote

Direct insights from Eric Bernhardson

"We've built custom infrastructure from scratch. Not containers. Our own runtime."

— Eric Bernhardson

"Cloud has been a phenomenal invention... but it's arguably a step backwards in terms of developer experience."

— Eric Bernhardson

"The best developer experience is one where you're not even thinking about infrastructure at all."

— Eric Bernhardson

"You just write Python functions. You don't write YAML. You don't write infrastructure code."

— Eric Bernhardson

"If I can do something in five seconds, I'll try it. If it takes 15 minutes, I'll think twice."

— Eric Bernhardson

How Modal Works

Simple Python code that runs in the cloud

Basic Function Deployment

Python

import modal

app = modal.App("example-app")

@app.function()
def hello_world():
    return "Hello from the cloud!"

if __name__ == "__main__":
    with app.run():
        result = hello_world.remote()
        print(result)  # "Hello from the cloud!"

The @app.function() decorator tells Modal to run this function in their cloud. When you call .remote(), Modal handles everything.

GPU Access with One Line

H100 GPU

@app.function(gpu="h100")
def train_model():
    import torch
    # This runs on an H100 with CUDA automatically available
    device = torch.device("cuda")
    model = LargeModel().to(device)
    # Your training code here
    return training_results

The gpu="h100" parameter reserves an H100 GPU. Modal handles provisioning, CUDA drivers, and environment setup automatically.

Parallel Execution at Scale

1000x Parallel

@app.function()
def process_batch(batch_id):
    # Process a single batch of data
    return results

# Process 1000 batches in parallel
batch_ids = list(range(1000))

with app.run():
    # Map-style parallel execution
    results = process_batch.map(batch_ids)

The .map() method spawns 1000 containers in parallel. What would normally require a Kubernetes cluster is now one method call.