The AI Developer Experience Doesn't Have to Suck
Why and how Modal rebuilt cloud infrastructure from scratch to achieve sub-second GPU cold starts
"Cloud has been a phenomenal invention... but it's arguably a step backwards in terms of developer experience. Solving container cold start in a distributed system is a very, very deep rabbit hole."
— Eric Bernhardson, CEO of Modal
1.5s
Cold Start Time
20-60x faster than traditional containers (30-90s)
Seconds
H100 Access
Instant GPU allocation vs weeks of capacity planning
$50K
Startup Credits
Up to $50,000 in free compute for qualified startups
5-10x
Cost Reduction
Through GPU pooling and optimization vs traditional cloud
The Cloud Bottleneck Nobody Talks About
For years, AI developers accepted a painful truth: cloud infrastructure makes you slower. While modern AI models train in minutes, the surrounding infrastructure moves at a glacial pace. You spin up containers, wait for cold starts, configure Kubernetes clusters, and spend hours on tasks that have nothing to do with your actual AI work. This infrastructure tax eats into development time, kills momentum, and creates friction between having an idea and testing it.
The problem runs deeper than just inconvenience. When every experiment requires provisioning infrastructure, you automatically become conservative. You iterate less because starting up takes too long. You postpone tests because setup costs are too high. The container paradigm—revolutionary for web services—becomes a ball and chain for AI development, where rapid experimentation is everything.
Modal took a different approach. Instead of optimizing containers, they deleted them entirely. "We've built custom infrastructure from scratch. Not containers. Our own runtime." This fundamental rethinking eliminated cold starts entirely. By snapshotting memory state and using content-addressed storage, Modal can restore a running Python environment in approximately 1.5 seconds—not minutes. This breakthrough transforms the developer experience from "plan carefully before executing" to "experiment freely and iterate fast."
Key Insights
8 major insights from the keynote
1. The Container Model Is Broken for AI Development
"We've built custom infrastructure from scratch. Not containers. Our own runtime."
Containers were designed for long-running web services, not bursty AI workloads. For AI development, where you need to run hundreds of short experiments, container overhead becomes crushing. Modal's decision to abandon containers entirely and build a custom runtime represents a fundamental architectural shift—treating infrastructure as a first-class citizen for AI workloads, not an afterthought inherited from web development.
2. Memory Snapshotting Changes Everything
"It's not like you just snapshot your code. You snapshot your entire state. All your memory state. All the pages in memory. The CPU state. The register state. The file descriptor table. Everything."
Traditional containers restart from scratch every time, pulling dependencies and initializing environments. Modal's snapshotting captures the entire running state—memory, CPU, file descriptors—and stores it as an immutable artifact. This means your Python environment with PyTorch, CUDA, and all dependencies loads in ~1.5 seconds, not the 30-90 seconds typical of container cold starts. For iterative AI development, this 20-60x speedup is transformative.
3. Content-Addressed Storage Enables Global Scale
"The cool thing about content-addressed storage is that it's inherently deduplicated."
Content-addressed storage means files are identified by their cryptographic hash, not their location. If two users need the same Python package, Modal stores it once globally. This creates massive efficiency gains: popular dependencies like PyTorch or CUDA libraries exist in a single global cache, accessible from any region. The system scales automatically because storage costs decrease as more users join—a rare example of a system that gets more efficient with scale.
4. GPU Pooling Through Statistical Multiplexing
"We don't need to reserve GPUs for you. We can take advantage of the fact that we know when your container is running, when it's idle. We can pack workloads together much more tightly."
Traditional cloud GPU allocation reserves entire GPUs for specific customers, leading to massive waste—estimated at 80-90% underutilization. Because Modal controls the entire runtime, it knows exactly when containers are active versus idle. This enables "statistical multiplexing"—running multiple workloads on the same GPU at different times, dramatically improving utilization and reducing costs by 3-5x compared to traditional cloud providers.
5. Multi-Cloud Capacity Is a Hidden Complexity Nightmare
"We have to fetch across clouds. We have GPU capacity in GCP, in Azure, in AWS. But your snapshot might live somewhere else."
When you operate at cloud scale, you face a reality that most developers never see: GPU shortages are regional and constant. AWS might have no H100s in us-east-1 today, while Azure has surplus in the same region. Modal's content-addressed storage allows transparent multi-cloud fetching—your snapshot might live in AWS, but your GPU is in Azure. The system handles this complexity invisibly, ensuring you always get capacity regardless of which cloud provider actually has it.
6. The Feedback Loop Is the Productivity Metric
"If I can do something in five seconds, I'll try it. If it takes 15 minutes, I'll think twice."
This simple observation explains why infrastructure friction matters so much. Short feedback loops encourage experimentation; long feedback loops encourage planning and caution. For AI development, where progress comes from rapid iteration, the difference between a 5-second test and a 15-minute test is the difference between trying 50 ideas per day versus 5. Modal's 1.5-second startup time creates a fundamentally different development mindset—one optimized for discovery rather than caution.
7. Simplicity Wins: No YAML, No Infrastructure Code
"You just write Python functions. You don't write YAML. You don't write infrastructure code."
Every line of infrastructure code is a line not spent on your actual problem. Infrastructure-as-code tools like Terraform and Helm reduce some complexity but introduce new layers of abstraction. Modal's approach—just write Python functions—eliminates the entire category of infrastructure code. Your business logic and your deployment logic become the same thing, reducing cognitive load and eliminating the "it works on my machine but not in production" class of bugs entirely.
8. Sub-Second Cold Starts Are Achievable
"We can get it down to about 1.5 seconds for cold start. That's pretty fast. That includes pulling your snapshot, starting the container, everything."
For comparison, traditional container cold starts typically take 30-90 seconds for AI environments. A 20-60x speedup transforms the development experience from "plan carefully" to "experiment freely." This breakthrough comes from Modal's custom infrastructure—content-addressed storage eliminates data transfer overhead, memory snapshotting skips initialization entirely, and aggressive prefetching anticipates file access patterns.
Technical Deep Dive
How Modal solved the container cold start problem
Content-Addressed Storage: The Efficiency Engine
What It Is: Content-addressed storage is a paradigm where files are stored and retrieved based on their cryptographic hash (typically SHA-256), not their filename or location. When you store a file, the system calculates its hash and uses that as the address. When you retrieve it, you provide the hash, and the system returns the file.
Why It Matters for AI Workloads:
- Automatic Deduplication: If 100 users need the same PyTorch installation, it's stored exactly once
- Global Availability: Your snapshot is accessible from any region without re-download
- Integrity Verification: The hash serves as both address and integrity check
- Cache-Friendly: Popular dependencies propagate to edge locations automatically
Modal built a custom content-addressed storage system optimized for AI workloads. Rather than treating each file independently, they understand the relationship between files—Python packages, CUDA libraries, model weights—and can optimize storage and retrieval patterns accordingly.
Memory Snapshotting with gVisor: The Breakthrough Innovation
The Challenge: Snapshotting a running process is notoriously difficult. You need to capture all memory pages, CPU register state, file descriptors, network connections, and internal kernel state. Standard Linux tools like CRIU exist but are complex and fragile, especially for GPU workloads where CUDA context must also be preserved.
Modal's Solution: Modal extended gVisor—a user-space kernel implementation originally developed by Google—to add comprehensive checkpointing capabilities. gVisor provides a clean abstraction layer between the application and the Linux kernel, making it possible to capture and restore process state reliably.
How It Works:
- 1.Capture Phase: When a container finishes executing, Modal captures all user-space memory, CPU state, file descriptors, network sockets, and GPU memory/CUDA context
- 2.Storage Phase: The captured state is serialized and stored in content-addressed storage. Because it's immutable, it can be cached and replicated globally
- 3.Restore Phase: When a new container needs to run, the system locates the snapshot, streams it to the target host, and restores the process state in ~1.5 seconds
Why It's Revolutionary: Traditional container startup is "pull image → extract files → install dependencies → initialize → run." Modal's startup is "locate snapshot → restore → run." By skipping all the initialization work, Modal achieves 20-60x faster cold starts. For iterative AI development, where you might start hundreds of containers per day, this is the difference between infrastructure being a minor annoyance versus a major bottleneck.
GPU Pooling Economics: The Cost Optimization Engine
Traditional Cloud GPU Allocation
- • You reserve a GPU (e.g., 1x H100 for $3/hour)
- • You pay 24/7, regardless of usage
- • Typical dev workload: 2 hours active/day
- • Effective cost: $36/hour of actual compute
- • Utilization: ~8%
Modal's GPU Pooling
- • You pay only when your code runs
- • Multiple users share the same GPU
- • Typical dev workload: 2 hours active/day
- • Effective cost: $3-6/hour of actual compute
- • Utilization: 60-80%
The Math: For the same workload, Modal costs 5-10x less because you're not paying for idle time. Traditional cloud providers reserve GPUs because they lack visibility into workload patterns. Modal's custom runtime gives perfect visibility, enabling safe sharing and dramatic cost reductions.
Real Quotes from the Keynote
Direct insights from Eric Bernhardson
"We've built custom infrastructure from scratch. Not containers. Our own runtime."
— Eric Bernhardson
"Cloud has been a phenomenal invention... but it's arguably a step backwards in terms of developer experience."
— Eric Bernhardson
"The best developer experience is one where you're not even thinking about infrastructure at all."
— Eric Bernhardson
"You just write Python functions. You don't write YAML. You don't write infrastructure code."
— Eric Bernhardson
"If I can do something in five seconds, I'll try it. If it takes 15 minutes, I'll think twice."
— Eric Bernhardson
How Modal Works
Simple Python code that runs in the cloud
Basic Function Deployment
import modal
app = modal.App("example-app")
@app.function()
def hello_world():
return "Hello from the cloud!"
if __name__ == "__main__":
with app.run():
result = hello_world.remote()
print(result) # "Hello from the cloud!"The @app.function() decorator tells Modal to run this function in their cloud. When you call .remote(), Modal handles everything.
GPU Access with One Line
@app.function(gpu="h100")
def train_model():
import torch
# This runs on an H100 with CUDA automatically available
device = torch.device("cuda")
model = LargeModel().to(device)
# Your training code here
return training_resultsThe gpu="h100" parameter reserves an H100 GPU. Modal handles provisioning, CUDA drivers, and environment setup automatically.
Parallel Execution at Scale
@app.function()
def process_batch(batch_id):
# Process a single batch of data
return results
# Process 1000 batches in parallel
batch_ids = list(range(1000))
with app.run():
# Map-style parallel execution
results = process_batch.map(batch_ids)The .map() method spawns 1000 containers in parallel. What would normally require a Kubernetes cluster is now one method call.
Serverless Economics
How Modal reduces costs through better GPU utilization
Free Tier
$30/month
Indefinitely for everyone
- No credit card required
- Full platform access
- Recurring monthly credits
Startup Program
Up to $50,000
For qualified startups
- Pre-Series A eligibility
- Priority support
- 6-month compute credits
Usage-Based Pricing
5-10x Lower
vs traditional cloud
- Pay only for active time
- No capacity planning
- GPU pooling savings
Get Started with Modal
Stop managing infrastructure. Start building AI. The future of AI infrastructure is serverless, containerless, and invisible.
pip install modalRelated Talks
More AI engineering insights
Devin 2.0: Moore's Law for AI Agents
Scott Wu from Cognition on the evolution of autonomous AI agents
Poolside's Path to AGI
Jason Warner and Eiso Kant on reinforcement learning and vertical integration
Good Design Hasn't Changed With AI
John Pham from SF Compute on design principles in the AI era
Research Methodology: This analysis is based on the full transcript from the AI Engineer Summit keynote.
All quotes are verbatim from the talk. For more details, watch the full video on YouTube.