GPU Optimization

AI Kernel Generation: What's Working, What's Not, What's Next

Natalie Serrino from Gimlet Labs presents a comprehensive overview of using AI agents to automatically generate optimized GPU kernels. Discover real performance gains (25-70% speedups), the challenges of correctness verification, and realistic expectations for this promising technology.

"This is not a silver bullet but it is a promising new tool in the toolbox. The best applications that we see are things like searching across many bags of tricks."

Natalie Serrino, Gimlet Labs (00:16:48)

22-70%

Speedup achieved

~20min

Optimization time

Custom kernels example

L1-L3

Kernel benchmark tiers

What Are GPU Kernels?

Clarifying terminology and the problem space

Before diving into the technical details, it's important to clarify what we mean by "kernels" in this context. This talk is NOT about AI generating operating systems like the Linux kernel—it's about optimizing the individual functions that perform massive parallel computations on GPUs.

"What do I mean by kernels? I do not mean AI generating operating systems like the Linux kernel. What I mean is kernels at the sense of like transformer architecture like the individual like functions that perform like massive parallel computations leveraging like all the crazy amounts of threads that GPUs have."

Clarifying that this is about GPU kernels, not OS kernels

Watch (00:01:25)

"We're building an agentic inference cloud focused on performance and efficiency. And the thing that we've seen with all these talks so far is with agents, they're not just one chat model. They're complex pipelines of multiple models, multiple stages, tool calls, and the compute backing these is inherently heterogeneous."

Explaining the need for heterogeneous compute optimization

Watch (00:00:35)

The Core Problem

Agentic workloads are complex pipelines of multiple models, stages, and tool calls. The compute backing these workloads is inherently heterogeneous—spanning different vendors and hardware sizes. Models optimized for one hardware may not perform optimally on another. AI kernel generation aims to automatically port and optimize segments of agentic workloads across this diverse hardware landscape.

What's Working

Real performance gains and successful applications

AI-driven kernel generation is delivering measurable results across multiple hardware platforms. From Apple's M4 to NVIDIA's RTX 6000 Blackwell, agents are automatically finding optimizations that rival or exceed human experts.

22%

Faster - Candidate optimization in 20 minutes

40%

Speedup - Apple M4 with kernel fusion

Performance - Vision transformer model

70%

Faster - Audio encoder with 6 kernels

"The system had explored a bunch of candidate optimizations. It's comparing to eager mode and torch compile and it found one candidate that was 22% faster than the torch compile baseline. So this was a real case. It just sped up because it actually took about 20 minutes."

Real-world result: 22% speedup in 20 minutes

Watch (00:05:17)

"What the agent did was it took four of those ops and instead of running individual functions for those it made a mega function that compacted them all together. So kernel fusion isn't new. It's something that torch compile already does quite well, but it's a common way that we found agents can speed up these workloads because you can really customize it to the specific use case."

Kernel fusion as an AI-driven optimization technique

Watch (00:08:32)

"This result achieved a 40% speed up over the baseline on the M4."

Apple M4 benchmark results with custom kernel fusion

Watch (00:08:53)

"Using torch compile, ours was twice as fast. So this was like a holy sh*t moment."

Vision transformer model achieved 2x speedup

Watch (00:15:27)

"It was about 70% faster. Both implementations using torch compile."

Audio encoder model with 6 custom kernels on RTX 6000 Blackwell

Watch (00:16:17)

"We know that fusion works. We know that tiling works and we can run lots of experiments really quickly this way by launching them with agents and see what actually performs the best on the workload."

AI's strength: rapid experimentation across known techniques

Watch (00:16:59)

Gimlet's Architecture

Agentic swarm approach to kernel optimization

Gimlet Labs has built a sophisticated multi-agent system for automatic kernel optimization. The architecture separates concerns between coordination, idea generation, and rigorous verification.

Supervisor Agent

Takes input code, target hardware, and human prompting. Coordinates the entire optimization workflow.

Orchestration

Synthesis Swarm

Collective of agents that generate optimization ideas. The "idea factory" exploring techniques like fusion and tiling.

Idea Generation

Verification Agent

Hardware-in-the-loop system that runs candidates on actual hardware. Extremely strict about correctness.

Validation

"We have a supervisor agent which takes in input code target hardware and then also human prompting because humans still can really guide the best path for optimization. That supervisor is in charge of managing the work. It deploys the synthesis agentic swarm which collectively work together to come up with ideas for optimizations and they are basically the idea factory coming up with new techniques."

Supervisor agent coordinating optimization workflow

Watch (00:14:25)

"Those ideas get sent to the verification agent which is running them in on actual hardware in a hardware in the loop system to see how they do."

Hardware-in-the-loop verification system

Watch (00:14:53)

What's Not Working (Yet)

Challenges, limitations, and realistic expectations

AI kernel generation is powerful but not magic. Understanding the current limitations and challenges is crucial for setting realistic expectations and avoiding disappointment.

Key Challenges

Expert Shortage

Very few GPU kernel optimization experts exist. Those who do are overwhelmed with work. This is the problem AI aims to solve, but it also means limited training data.

Correctness Verification

Defining "correct" for floating-point computations is difficult. Naive timing measurements are often wrong. Verification agents must be extremely strict.

Benchmark Gaps

Limited examples of low-level kernels across different hardware platforms. Input data scarcity and benchmarking challenges.

Measurement Complexity

Cache effects, warm-up periods, launch time vs execution time. Many gotchas that require careful handling.

"There's just not enough experts to be able to solve every problem. Everyone on Twitter is whining about how it's impossible to find these people and the people that exist are like really overt taxed with so much to do, so much work."

The shortage of GPU kernel optimization experts

Watch (00:02:27)

"So there's some challenges though with measuring these agents at kernel synthesis. So um like first of all you have to figure out what your definition of correct is when you're dealing with floating point. This is always a question."

Floating-point correctness verification challenges

Watch (00:05:32)

"if you just do a naive timer start on your implementation, it's probably going to be wrong. And there was a great blog that had a diagram for this because you're basically measuring the launch time, not the execution time. So there's a bunch of kind of gotchas like that that when you're building an agentic system like this, you have to be really careful about catching doing things like warm-ups and cache clearing"

Benchmarking complexities and gotchas

Watch (00:06:00)

"You also need great benchmarks for this. I think that someone said earlier that there's not a ton of examples of low-level kernels across all these different hardware platforms. And so the input data is a challenge and also benchmarking it is a challenge."

Lack of comprehensive benchmarks across hardware platforms

Watch (00:06:37)

"That verification agent needs to be extremely strict about making sure that no funny business is happening. And that's a major part of the challenge."

Verification challenges in ensuring correctness

Watch (00:15:00)

Realistic Limitations

"That is a valid optimization, but I wouldn't necessarily call it rocket science. So we consider that to be a trivial case study where if you're not using a more optimized attention module, maybe you haven't actually optimized your workload that much yet."

AI finding obvious optimizations (swapping attention modules)

Watch (00:15:45)

"In terms of the worst applications, we're still not at the point where they're writing the N plus 1 for flash attention, coming up with like those genius algorithmic advances. And they're not currently outperforming a human expert who bang their head on this problem for months. And we shouldn't expect them to be."

AI not yet replacing human experts for novel algorithmic discoveries

Watch (00:17:39)

When to Use AI Kernel Generation

Best applications and current limitations

Best Applications

Searching Optimization Space

Rapidly test fusion, tiling, and other known techniques. Launch many experiments in parallel to find what performs best.

Porting to New Hardware

Take insights from existing implementations and specialize them to new hardware features. Automatically adapt optimizations.

Adopting New Optimizations

Quickly adapt to changing requirements like different quantization schemes. Guide optimizations using known patterns.

Force Multiplier for Experts

Let human experts focus on novel algorithmic advances. AI handles the routine optimization work at scale.

Not Ready For

Novel Algorithm Discovery

Not inventing Flash Attention N+1 or breakthrough algorithms. Human experts still superior for deep algorithmic research.

Replacing Human Expertise

Cannot match experts who spend months on specific problems. AI is a tool to augment, not replace, human creativity.

Silver Bullet Expectations

Not magic. Works best as part of comprehensive optimization strategy. Still requires human guidance and oversight.

What's Next

Future directions and open research questions

The field of AI-driven kernel generation is rapidly evolving. Gimlet Labs and other researchers are exploring several promising directions that could dramatically expand the capabilities of automated optimization.

Hardware Abstraction

Building abstract models of different machines to help agents further specialize code to individual hardware characteristics.

PTX Assembly Generation

Generating NVIDIA assembly (PTX) directly. AI may outperform humans at this cumbersome low-level programming task.

Formal Verification

Exploring academic formal verification methods for correctness. Proving optimized kernels are mathematically sound.

"We want to build abstract models of different machines to help the agents further specialize code to individual hardware."

Future work: Hardware abstraction models

Watch (00:18:09)

"We're also interested in generating basically what is like NVIDIA assembly such as PTX. You can see an example here because the thought is that we can basically do that better with AI than humans because it's so cumbersome."

Future work: PTX assembly generation

Watch (00:18:18)

"And then also looking at academic formal verification methods for correctness."

Future work: Formal verification methods

Watch (00:18:32)

Key Takeaways

Practical insights for engineers and researchers

1. Promising Tool, Not Silver Bullet

Realistic Expectations

•AI kernel generation delivers 25-70% speedups in real scenarios
•Works best for searching known optimization spaces, not inventing novel algorithms
•Force multiplier for human experts, not replacement
•Expect continued improvement but don't expect magic

2. Best for Rapid Experimentation

Strong Use Cases

•Quickly testing fusion, tiling, and other known techniques
•Porting optimizations to new hardware platforms
•Adapting to changing requirements (quantization, etc.)
•Scaling optimization work that humans can't manually handle

3. Verification is Critical

Major Challenges

•Floating-point correctness is difficult to define and verify
•Benchmarking has many gotchas: warm-ups, cache effects, launch time
•Verification agents must be extremely strict
•Hardware-in-the-loop testing is essential

4. Future Looks Promising

What's Next

•Hardware abstraction models for better specialization
•PTX assembly generation for low-level optimization
•Formal verification methods for correctness proofs
•Continued progress toward novel algorithm discovery

Source Video

AI Kernel Generation: What's working, what's not, what's next

Natalie Serrino • Gimlet Labs

Video ID: 6guQG_tGt0o•Duration: ~19 minutes•AI Engineer Conference

Watch on YouTube

Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers AI-driven kernel generation techniques, real performance benchmarks (22-70% speedups), challenges in correctness verification, and future research directions including PTX generation and formal verification methods.

Key Concepts: GPU kernels, kernel fusion, agentic swarm, hardware-in-the-loop verification, heterogeneous compute, torch.compile, PyTorch, transformer architecture, Apple M4, RTX 6000 Blackwell, PTX assembly, formal verification