Efficient Reinforcement Learning
Asynchronous Pipeline RL, GPU Utilization Optimization, and Enterprise ML Systems
Rhythm Garg & Linden Li
Co-founders
"We need our runs to be fast so that we can train a model and deliver it to a customer very quickly on the order of days. They have to be cheap so that our unit costs work and we're able to scale the business sustainably."
Rhythm Garg explains the core business challenge
Rhythm Garg
Sample Timing
99% in 40s
Last 1% takes 80s more
GPU Idle Time
Significant
In synchronous RL during straggler waits
Batch Size
Dynamic
Limited by KV cache memory
Staleness
Trade-off
Speed vs stability balance
Executive Summary
Applied Compute's approach to efficient RL systems for enterprise applications
This talk presents Applied Compute's approach to building efficient Reinforcement Learning systems for enterprise applications. Founded by former OpenAI researchers, the company helps enterprises build specialized AI systems that deliver ROI through real automations rather than just productivity improvements.
The core challenge is the efficiency gap between lab RL (which runs for weeks) and enterprise RL (which needs to run in days with reliable performance). Applied Compute has developed both algorithmic innovations and systems optimizations to create efficient RL stacks that scale sustainably.
Key Insight
Enterprise RL Requirements
Different constraints for business-critical machine learning
"At Applied Compute, we're not really helping enterprises solve math problems, but this is kind of the mechanism by which we're able to teach the models to get really, really good at tasks that they care about."
Rhythm Garg on RL as a mechanism for task mastery
Rhythm Garg
"The research problem for us that is very business critical is can we build an RL stack that is so efficient so that in conjunction with our agent building platform we are really able to scale up this use case specific training motion."
Rhythm Garg on the business-critical nature of efficiency
Rhythm Garg
"We think a lot about how do we push AI beyond productivity into real automations that deliver ROI. That's quantitative for the company."
Rhythm Garg on moving beyond productivity to ROI
Rhythm Garg
Enterprise ML Constraints
Enterprise RL requires fundamentally different approaches than academic lab settings. Runs must complete in days (not weeks) with sustainable unit costs. The focus shifts from pure research results to ROI-driven automation that delivers measurable business impact.
The Synchronous RL Problem
Lock-step sampling and training creates GPU inefficiency
"In synchronous RL, sampling and training happen in lock step. So there's some simplifications here, but let's say that we want to train on batches of eight samples. That means we're going to wait for all eight samples to finish and basically finish completion before we start training."
Rhythm Garg explains synchronous RL lock-step constraint
Rhythm Garg
"It turns out that 99% of the samples completed in about 40 seconds. Took another 80 seconds to get that last percent of samples to complete. It really has a long tail."
Rhythm Garg on the straggler problem
Rhythm Garg
"The technical term we use at applied compute is the GPUs are slacking. Um, so synchronous RL is not an efficient way to to use these GPUs."
Rhythm Garg on GPU idle time
Rhythm Garg
The Straggler Problem
In synchronous RL, 99% of samples complete in ~40 seconds, but the last 1% takes another 80 seconds. During this waiting period, GPUs sit idle—the technical term used at Applied Compute is that the GPUs are "slacking." This fundamental inefficiency makes synchronous RL impractical for enterprise workloads.
Asynchronous Pipeline RL
Breaking lock-step for better GPU utilization
"In order to solve this problem, we need to break the condition that sampling and training need to happen in lock step. In other words, we need to allow training while we're sampling. This is called asynchronous RL."
Rhythm Garg introduces asynchronous RL
Rhythm Garg
"The concrete trade-off is we want a lot of staleness for fast RL runs, but a lot of staleness makes learning unstable, which then requires innovating on the algorithm and the science."
Rhythm Garg on the staleness trade-off
Rhythm Garg
"One of the primary research problems that we focus on here at Applied Compute. And as I was talking about earlier, it directly flows back into our core business."
Rhythm Garg on algorithmic innovation for staleness
Rhythm Garg
The Staleness Trade-off
Asynchronous RL eliminates GPU idle time by allowing training while sampling continues. However, this introduces staleness—samples generated with older model versions. Higher staleness enables faster training but makes learning unstable. This requires algorithmic innovation to manage importance ratio variance and maintain training stability.
System-Level Optimization
First-principles modeling for end-to-end performance
"We posed this as a modeling problem of our end-to-end system which you know admittedly is a little bit complicated at first but we did find that we can get surprisingly far with some first principle systems modeling."
Linden Li on system-level modeling
Linden Li
"With any modeling problem let's figure out the cast of characters that describe the system and then we'll think about how they all fit together to model it."
Linden Li on identifying system components
Linden Li
"The synchronous setup might not be the most principled choice, as Rhythm showed, is an asynchronous setup. But it's not just as easy as just sort of provisioning the compute between training and inference."
Linden Li on provisioning complexity
Linden Li
GPU Allocation Optimization
Finding the optimal split between sampling and training GPUs requires system-level modeling:
- 1.Too many sampling GPUs: Training GPUs starve, low utilization
- 2.Too many training GPUs: Sampling can't keep up, queue buildup
- 3.Optimal balance: Maximizes throughput while managing staleness
The solution involves modeling queue dynamics, staleness accumulation, and GPU utilization curves to find the allocation that maximizes overall system performance.
GPU Utilization Extremes
Understanding provisioning trade-offs
"If we provision way too many inference GPUs and not that many samplers... we have the same problem of low GPU utilization in the synchronous case as shown earlier."
Linden Li on GPU provisioning extremes
Linden Li
"Here in each yellow square, which is the staleness count of each sample, goes up. And as time moves on, we get more and more stale. And so the samples get more and more kind of less more and more transparent as a result."
Linden Li on staleness visualization
Linden Li
"In steady state, the batch size is relatively consistent compared to the synchronous setup where it kind of goes down over time."
Linden Li on batch size consistency
Linden Li
Steady State Consistency
In asynchronous RL, batch sizes remain relatively consistent in steady state, unlike synchronous setups where batch size degrades over time as stragglers complete. This consistency enables better resource utilization and more predictable training performance.
Understanding Workload Characteristics
Response length distribution and batch dynamics
"We need to know the response length distribution to figure out how our training workload's going to work and also how long the sampling's going to take."
Linden Li on response length distribution
Linden Li
"The batch size begins very high, but it slowly goes down over time as it eventually goes to zero and all the samples complete."
Linden Li on batch size dynamics
Linden Li
Workload Modeling Requirements
- Response length distribution - Determines sampling time and training workload
- Batch size dynamics - Impacts GPU memory and throughput
- Completion time variance - Affects queue buildup and staleness
- KV cache constraints - Limits maximum batch size for sampling
- Latency curves - Memory bound vs compute bound regimes
Technical Concepts Deep Dive
Core RL and systems concepts explained
Synchronous RL
Sampling and training happen in lock step. All samples must complete before training begins, causing GPU idle time during straggler waits.
Asynchronous RL
Training happens while sampling is ongoing. More efficient but introduces staleness - samples generated with older model versions.
Pipeline RL
Dedicated GPUs for sampling and training. Workers run continuously with in-flight weight updates propagating model changes.
Staleness
Difference in model versions between sampling and training. Higher staleness = faster training but unstable learning.
Importance Ratios
Mathematical corrections in policy gradient methods to adjust for sampling from different policy versions.
GPU Memory Constraints
Model weights, activations, and KV cache compete for GPU memory. Batch size limited by KV cache in sampling.
Key Takeaways
Actionable insights from Applied Compute's approach
8 Key Insights for Efficient RL Systems
Asynchronous Pipeline RL
Break the lock-step condition between sampling and training. Allow training while sampling continues, dramatically improving GPU utilization.
GPU Utilization is Critical
In synchronous RL, GPUs 'slack' while waiting for straggler samples. 99% of samples complete in ~40s, but the last 1% takes another 80s.
Staleness Trade-off
Higher staleness enables faster training but makes learning unstable. Requires algorithmic innovation to manage importance ratio variance.
System-Level Modeling
First-principles modeling of end-to-end system performance can guide optimal GPU allocation between sampling and training.
Enterprise RL Requirements
Enterprise RL needs to run in days (not weeks) with sustainable unit costs. Different constraints than academic lab environments.
Data Flywheel Approach
Build systems that improve over time through usage. Create in-house expertise that stays at the forefront of the field.
Response Length Distribution
Understanding workload characteristics including response length distribution is essential for proper system tuning and GPU allocation.
Steady State Consistency
Asynchronous RL achieves consistent batch sizes in steady state, unlike synchronous where batch size degrades over time.
Key Timestamps
Navigate to important moments in the talk
Enterprise RL Approach
RL as mechanism for task mastery in enterprise settings
Business Constraints
Fast (days) and cheap runs for sustainable scaling
Synchronous RL Problem
Lock-step sampling and training inefficiency
Straggler Problem
99% complete in 40s, last 1% takes 80s
Asynchronous Solution
Breaking lock-step for better GPU utilization
Staleness Trade-off
Fast runs vs learning stability
System Modeling
First-principles end-to-end optimization
GPU Provisioning
Balancing sampling and training GPUs
Staleness Visualization
How staleness accumulates over time
Workload Characterization
Response length distribution and batch dynamics
Source Video
Efficient Reinforcement Learning – Rhythm Garg & Linden Li, Applied Compute
Research Methodology: This analysis is based on full transcript analysis with verified timestamps. Key quotes have been extracted with exact YouTube timestamps for reference. The technical insights are derived from the speakers' presentations at AI Engineer Summit.
Related Highlights
Explore more insights on AI engineering and ML infrastructure
Building Cursor Composer
RL infrastructure challenges for coding agents, parallel tool calling, and 4x token efficiency
AI Kernel Generation
Custom kernels, compiler optimization, and hardware acceleration for ML workloads
Devin 2.0 and Moore's Law
AI agents, software engineering automation, and the future of development