AI Kernel Generation: What's Working, What's Not, What's Next
Natalie Serrino from Gimlet Labs presents a comprehensive overview of using AI agents to automatically generate optimized GPU kernels. Discover real performance gains (25-70% speedups), the challenges of correctness verification, and realistic expectations for this promising technology.
"This is not a silver bullet but it is a promising new tool in the toolbox. The best applications that we see are things like searching across many bags of tricks."Natalie Serrino, Gimlet Labs (00:16:48)
Speedup achieved
Optimization time
Custom kernels example
Kernel benchmark tiers
What Are GPU Kernels?
Clarifying terminology and the problem space
Before diving into the technical details, it's important to clarify what we mean by "kernels" in this context. This talk is NOT about AI generating operating systems like the Linux kernel—it's about optimizing the individual functions that perform massive parallel computations on GPUs.
"What do I mean by kernels? I do not mean AI generating operating systems like the Linux kernel. What I mean is kernels at the sense of like transformer architecture like the individual like functions that perform like massive parallel computations leveraging like all the crazy amounts of threads that GPUs have."
Clarifying that this is about GPU kernels, not OS kernels
Watch (00:01:25)"We're building an agentic inference cloud focused on performance and efficiency. And the thing that we've seen with all these talks so far is with agents, they're not just one chat model. They're complex pipelines of multiple models, multiple stages, tool calls, and the compute backing these is inherently heterogeneous."
Explaining the need for heterogeneous compute optimization
Watch (00:00:35)The Core Problem
Agentic workloads are complex pipelines of multiple models, stages, and tool calls. The compute backing these workloads is inherently heterogeneous—spanning different vendors and hardware sizes. Models optimized for one hardware may not perform optimally on another. AI kernel generation aims to automatically port and optimize segments of agentic workloads across this diverse hardware landscape.
What's Working
Real performance gains and successful applications
AI-driven kernel generation is delivering measurable results across multiple hardware platforms. From Apple's M4 to NVIDIA's RTX 6000 Blackwell, agents are automatically finding optimizations that rival or exceed human experts.
Faster - Candidate optimization in 20 minutes
Speedup - Apple M4 with kernel fusion
Performance - Vision transformer model
Faster - Audio encoder with 6 kernels
"The system had explored a bunch of candidate optimizations. It's comparing to eager mode and torch compile and it found one candidate that was 22% faster than the torch compile baseline. So this was a real case. It just sped up because it actually took about 20 minutes."
Real-world result: 22% speedup in 20 minutes
Watch (00:05:17)"What the agent did was it took four of those ops and instead of running individual functions for those it made a mega function that compacted them all together. So kernel fusion isn't new. It's something that torch compile already does quite well, but it's a common way that we found agents can speed up these workloads because you can really customize it to the specific use case."
Kernel fusion as an AI-driven optimization technique
Watch (00:08:32)"This result achieved a 40% speed up over the baseline on the M4."
Apple M4 benchmark results with custom kernel fusion
Watch (00:08:53)"Using torch compile, ours was twice as fast. So this was like a holy sh*t moment."
Vision transformer model achieved 2x speedup
Watch (00:15:27)"It was about 70% faster. Both implementations using torch compile."
Audio encoder model with 6 custom kernels on RTX 6000 Blackwell
Watch (00:16:17)"We know that fusion works. We know that tiling works and we can run lots of experiments really quickly this way by launching them with agents and see what actually performs the best on the workload."
AI's strength: rapid experimentation across known techniques
Watch (00:16:59)Gimlet's Architecture
Agentic swarm approach to kernel optimization
Gimlet Labs has built a sophisticated multi-agent system for automatic kernel optimization. The architecture separates concerns between coordination, idea generation, and rigorous verification.
Supervisor Agent
Takes input code, target hardware, and human prompting. Coordinates the entire optimization workflow.
Synthesis Swarm
Collective of agents that generate optimization ideas. The "idea factory" exploring techniques like fusion and tiling.
Verification Agent
Hardware-in-the-loop system that runs candidates on actual hardware. Extremely strict about correctness.
"We have a supervisor agent which takes in input code target hardware and then also human prompting because humans still can really guide the best path for optimization. That supervisor is in charge of managing the work. It deploys the synthesis agentic swarm which collectively work together to come up with ideas for optimizations and they are basically the idea factory coming up with new techniques."
Supervisor agent coordinating optimization workflow
Watch (00:14:25)"Those ideas get sent to the verification agent which is running them in on actual hardware in a hardware in the loop system to see how they do."
Hardware-in-the-loop verification system
Watch (00:14:53)What's Not Working (Yet)
Challenges, limitations, and realistic expectations
AI kernel generation is powerful but not magic. Understanding the current limitations and challenges is crucial for setting realistic expectations and avoiding disappointment.
Key Challenges
Expert Shortage
Very few GPU kernel optimization experts exist. Those who do are overwhelmed with work. This is the problem AI aims to solve, but it also means limited training data.
Correctness Verification
Defining "correct" for floating-point computations is difficult. Naive timing measurements are often wrong. Verification agents must be extremely strict.
Benchmark Gaps
Limited examples of low-level kernels across different hardware platforms. Input data scarcity and benchmarking challenges.
Measurement Complexity
Cache effects, warm-up periods, launch time vs execution time. Many gotchas that require careful handling.
"There's just not enough experts to be able to solve every problem. Everyone on Twitter is whining about how it's impossible to find these people and the people that exist are like really overt taxed with so much to do, so much work."
The shortage of GPU kernel optimization experts
Watch (00:02:27)"So there's some challenges though with measuring these agents at kernel synthesis. So um like first of all you have to figure out what your definition of correct is when you're dealing with floating point. This is always a question."
Floating-point correctness verification challenges
Watch (00:05:32)"if you just do a naive timer start on your implementation, it's probably going to be wrong. And there was a great blog that had a diagram for this because you're basically measuring the launch time, not the execution time. So there's a bunch of kind of gotchas like that that when you're building an agentic system like this, you have to be really careful about catching doing things like warm-ups and cache clearing"
Benchmarking complexities and gotchas
Watch (00:06:00)"You also need great benchmarks for this. I think that someone said earlier that there's not a ton of examples of low-level kernels across all these different hardware platforms. And so the input data is a challenge and also benchmarking it is a challenge."
Lack of comprehensive benchmarks across hardware platforms
Watch (00:06:37)"That verification agent needs to be extremely strict about making sure that no funny business is happening. And that's a major part of the challenge."
Verification challenges in ensuring correctness
Watch (00:15:00)Realistic Limitations
"That is a valid optimization, but I wouldn't necessarily call it rocket science. So we consider that to be a trivial case study where if you're not using a more optimized attention module, maybe you haven't actually optimized your workload that much yet."
AI finding obvious optimizations (swapping attention modules)
Watch (00:15:45)"In terms of the worst applications, we're still not at the point where they're writing the N plus 1 for flash attention, coming up with like those genius algorithmic advances. And they're not currently outperforming a human expert who bang their head on this problem for months. And we shouldn't expect them to be."
AI not yet replacing human experts for novel algorithmic discoveries
Watch (00:17:39)When to Use AI Kernel Generation
Best applications and current limitations
Best Applications
Searching Optimization Space
Rapidly test fusion, tiling, and other known techniques. Launch many experiments in parallel to find what performs best.
Porting to New Hardware
Take insights from existing implementations and specialize them to new hardware features. Automatically adapt optimizations.
Adopting New Optimizations
Quickly adapt to changing requirements like different quantization schemes. Guide optimizations using known patterns.
Force Multiplier for Experts
Let human experts focus on novel algorithmic advances. AI handles the routine optimization work at scale.
Not Ready For
Novel Algorithm Discovery
Not inventing Flash Attention N+1 or breakthrough algorithms. Human experts still superior for deep algorithmic research.
Replacing Human Expertise
Cannot match experts who spend months on specific problems. AI is a tool to augment, not replace, human creativity.
Silver Bullet Expectations
Not magic. Works best as part of comprehensive optimization strategy. Still requires human guidance and oversight.
What's Next
Future directions and open research questions
The field of AI-driven kernel generation is rapidly evolving. Gimlet Labs and other researchers are exploring several promising directions that could dramatically expand the capabilities of automated optimization.
Hardware Abstraction
Building abstract models of different machines to help agents further specialize code to individual hardware characteristics.
PTX Assembly Generation
Generating NVIDIA assembly (PTX) directly. AI may outperform humans at this cumbersome low-level programming task.
Formal Verification
Exploring academic formal verification methods for correctness. Proving optimized kernels are mathematically sound.
"We want to build abstract models of different machines to help the agents further specialize code to individual hardware."
Future work: Hardware abstraction models
Watch (00:18:09)"We're also interested in generating basically what is like NVIDIA assembly such as PTX. You can see an example here because the thought is that we can basically do that better with AI than humans because it's so cumbersome."
Future work: PTX assembly generation
Watch (00:18:18)"And then also looking at academic formal verification methods for correctness."
Future work: Formal verification methods
Watch (00:18:32)Key Takeaways
Practical insights for engineers and researchers
1. Promising Tool, Not Silver Bullet
Realistic Expectations
- •AI kernel generation delivers 25-70% speedups in real scenarios
- •Works best for searching known optimization spaces, not inventing novel algorithms
- •Force multiplier for human experts, not replacement
- •Expect continued improvement but don't expect magic
2. Best for Rapid Experimentation
Strong Use Cases
- •Quickly testing fusion, tiling, and other known techniques
- •Porting optimizations to new hardware platforms
- •Adapting to changing requirements (quantization, etc.)
- •Scaling optimization work that humans can't manually handle
3. Verification is Critical
Major Challenges
- •Floating-point correctness is difficult to define and verify
- •Benchmarking has many gotchas: warm-ups, cache effects, launch time
- •Verification agents must be extremely strict
- •Hardware-in-the-loop testing is essential
4. Future Looks Promising
What's Next
- •Hardware abstraction models for better specialization
- •PTX assembly generation for low-level optimization
- •Formal verification methods for correctness proofs
- •Continued progress toward novel algorithm discovery
Source Video
AI Kernel Generation: What's working, what's not, what's next
Natalie Serrino • Gimlet Labs
Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis covers AI-driven kernel generation techniques, real performance benchmarks (22-70% speedups), challenges in correctness verification, and future research directions including PTX generation and formal verification methods.
Key Concepts: GPU kernels, kernel fusion, agentic swarm, hardware-in-the-loop verification, heterogeneous compute, torch.compile, PyTorch, transformer architecture, Apple M4, RTX 6000 Blackwell, PTX assembly, formal verification