AI Engineer Summit

~25-30 minutes

Efficient Reinforcement Learning

Asynchronous Pipeline RL, GPU Utilization Optimization, and Enterprise ML Systems

Rhythm Garg & Linden Li

Co-founders

"We need our runs to be fast so that we can train a model and deliver it to a customer very quickly on the order of days. They have to be cheap so that our unit costs work and we're able to scale the business sustainably."

Rhythm Garg explains the core business challenge

Rhythm Garg

Watch (00:01:30)

Sample Timing

99% in 40s

Last 1% takes 80s more

GPU Idle Time

Significant

In synchronous RL during straggler waits

Batch Size

Dynamic

Limited by KV cache memory

Staleness

Trade-off

Speed vs stability balance

Executive Summary

Applied Compute's approach to efficient RL systems for enterprise applications

This talk presents Applied Compute's approach to building efficient Reinforcement Learning systems for enterprise applications. Founded by former OpenAI researchers, the company helps enterprises build specialized AI systems that deliver ROI through real automations rather than just productivity improvements.

The core challenge is the efficiency gap between lab RL (which runs for weeks) and enterprise RL (which needs to run in days with reliable performance). Applied Compute has developed both algorithmic innovations and systems optimizations to create efficient RL stacks that scale sustainably.

Key Insight

Efficiency isn't just about faster hardware or better algorithms alone—it's about holistic system design that balances staleness, GPU utilization, and workload characteristics to achieve optimal performance.

Enterprise RL Requirements

Different constraints for business-critical machine learning

"At Applied Compute, we're not really helping enterprises solve math problems, but this is kind of the mechanism by which we're able to teach the models to get really, really good at tasks that they care about."

Rhythm Garg on RL as a mechanism for task mastery

Rhythm Garg

Watch (00:00:45)

"The research problem for us that is very business critical is can we build an RL stack that is so efficient so that in conjunction with our agent building platform we are really able to scale up this use case specific training motion."

Rhythm Garg on the business-critical nature of efficiency

Rhythm Garg

Watch (00:02:15)

"We think a lot about how do we push AI beyond productivity into real automations that deliver ROI. That's quantitative for the company."

Rhythm Garg on moving beyond productivity to ROI

Rhythm Garg

Watch (00:02:45)

Enterprise ML Constraints

Enterprise RL requires fundamentally different approaches than academic lab settings. Runs must complete in days (not weeks) with sustainable unit costs. The focus shifts from pure research results to ROI-driven automation that delivers measurable business impact.

The Synchronous RL Problem

Lock-step sampling and training creates GPU inefficiency

"In synchronous RL, sampling and training happen in lock step. So there's some simplifications here, but let's say that we want to train on batches of eight samples. That means we're going to wait for all eight samples to finish and basically finish completion before we start training."

Rhythm Garg explains synchronous RL lock-step constraint

Rhythm Garg

Watch (00:04:20)

"It turns out that 99% of the samples completed in about 40 seconds. Took another 80 seconds to get that last percent of samples to complete. It really has a long tail."

Rhythm Garg on the straggler problem

Rhythm Garg

Watch (00:05:10)

"The technical term we use at applied compute is the GPUs are slacking. Um, so synchronous RL is not an efficient way to to use these GPUs."

Rhythm Garg on GPU idle time

Rhythm Garg

Watch (00:05:45)

The Straggler Problem

In synchronous RL, 99% of samples complete in ~40 seconds, but the last 1% takes another 80 seconds. During this waiting period, GPUs sit idle—the technical term used at Applied Compute is that the GPUs are "slacking." This fundamental inefficiency makes synchronous RL impractical for enterprise workloads.

Asynchronous Pipeline RL

Breaking lock-step for better GPU utilization

"In order to solve this problem, we need to break the condition that sampling and training need to happen in lock step. In other words, we need to allow training while we're sampling. This is called asynchronous RL."

Rhythm Garg introduces asynchronous RL

Rhythm Garg

Watch (00:06:30)

"The concrete trade-off is we want a lot of staleness for fast RL runs, but a lot of staleness makes learning unstable, which then requires innovating on the algorithm and the science."

Rhythm Garg on the staleness trade-off

Rhythm Garg

Watch (00:07:15)

"One of the primary research problems that we focus on here at Applied Compute. And as I was talking about earlier, it directly flows back into our core business."

Rhythm Garg on algorithmic innovation for staleness

Rhythm Garg

Watch (00:08:00)

The Staleness Trade-off

Asynchronous RL eliminates GPU idle time by allowing training while sampling continues. However, this introduces staleness—samples generated with older model versions. Higher staleness enables faster training but makes learning unstable. This requires algorithmic innovation to manage importance ratio variance and maintain training stability.

System-Level Optimization

First-principles modeling for end-to-end performance

"We posed this as a modeling problem of our end-to-end system which you know admittedly is a little bit complicated at first but we did find that we can get surprisingly far with some first principle systems modeling."

Linden Li on system-level modeling

Linden Li

Watch (00:12:30)

"With any modeling problem let's figure out the cast of characters that describe the system and then we'll think about how they all fit together to model it."

Linden Li on identifying system components

Linden Li

Watch (00:13:00)

"The synchronous setup might not be the most principled choice, as Rhythm showed, is an asynchronous setup. But it's not just as easy as just sort of provisioning the compute between training and inference."

Linden Li on provisioning complexity

Linden Li

Watch (00:14:20)

GPU Allocation Optimization

Finding the optimal split between sampling and training GPUs requires system-level modeling:

1.Too many sampling GPUs: Training GPUs starve, low utilization
2.Too many training GPUs: Sampling can't keep up, queue buildup
3.Optimal balance: Maximizes throughput while managing staleness

The solution involves modeling queue dynamics, staleness accumulation, and GPU utilization curves to find the allocation that maximizes overall system performance.

GPU Utilization Extremes

Understanding provisioning trade-offs

"If we provision way too many inference GPUs and not that many samplers... we have the same problem of low GPU utilization in the synchronous case as shown earlier."

Linden Li on GPU provisioning extremes

Linden Li

Watch (00:15:45)

"Here in each yellow square, which is the staleness count of each sample, goes up. And as time moves on, we get more and more stale. And so the samples get more and more kind of less more and more transparent as a result."

Linden Li on staleness visualization

Linden Li

Watch (00:17:30)

"In steady state, the batch size is relatively consistent compared to the synchronous setup where it kind of goes down over time."

Linden Li on batch size consistency

Linden Li

Watch (00:19:15)

Steady State Consistency

In asynchronous RL, batch sizes remain relatively consistent in steady state, unlike synchronous setups where batch size degrades over time as stragglers complete. This consistency enables better resource utilization and more predictable training performance.

Understanding Workload Characteristics

Response length distribution and batch dynamics

"We need to know the response length distribution to figure out how our training workload's going to work and also how long the sampling's going to take."

Linden Li on response length distribution

Linden Li

Watch (00:20:30)

"The batch size begins very high, but it slowly goes down over time as it eventually goes to zero and all the samples complete."

Linden Li on batch size dynamics

Linden Li

Watch (00:21:00)

Workload Modeling Requirements

Response length distribution - Determines sampling time and training workload
Batch size dynamics - Impacts GPU memory and throughput
Completion time variance - Affects queue buildup and staleness
KV cache constraints - Limits maximum batch size for sampling
Latency curves - Memory bound vs compute bound regimes

Technical Concepts Deep Dive

Core RL and systems concepts explained

Synchronous RL

Sampling and training happen in lock step. All samples must complete before training begins, causing GPU idle time during straggler waits.

Asynchronous RL

Training happens while sampling is ongoing. More efficient but introduces staleness - samples generated with older model versions.

Pipeline RL

Dedicated GPUs for sampling and training. Workers run continuously with in-flight weight updates propagating model changes.

Staleness

Difference in model versions between sampling and training. Higher staleness = faster training but unstable learning.

Importance Ratios

Mathematical corrections in policy gradient methods to adjust for sampling from different policy versions.

GPU Memory Constraints

Model weights, activations, and KV cache compete for GPU memory. Batch size limited by KV cache in sampling.

Key Takeaways

Actionable insights from Applied Compute's approach

8 Key Insights for Efficient RL Systems

Asynchronous Pipeline RL

Break the lock-step condition between sampling and training. Allow training while sampling continues, dramatically improving GPU utilization.

GPU Utilization is Critical

In synchronous RL, GPUs 'slack' while waiting for straggler samples. 99% of samples complete in ~40s, but the last 1% takes another 80s.

Staleness Trade-off

Higher staleness enables faster training but makes learning unstable. Requires algorithmic innovation to manage importance ratio variance.

System-Level Modeling

First-principles modeling of end-to-end system performance can guide optimal GPU allocation between sampling and training.

Enterprise RL Requirements

Enterprise RL needs to run in days (not weeks) with sustainable unit costs. Different constraints than academic lab environments.

Data Flywheel Approach

Build systems that improve over time through usage. Create in-house expertise that stays at the forefront of the field.

Response Length Distribution

Understanding workload characteristics including response length distribution is essential for proper system tuning and GPU allocation.

Steady State Consistency

Asynchronous RL achieves consistent batch sizes in steady state, unlike synchronous where batch size degrades over time.

Key Timestamps

Navigate to important moments in the talk