Compilers in the
Age of LLMs
"You've got an open client in your codebase. You've got a few Hugging Face tabs open. You've got three different repos with the word playground in them. And you've got at least one agentic workflow that's really just stringing together a bunch of HTTP calls."
Yusuf Olokoba
Muna • 00:00:00
Minutes
Compilation
Hardware
Binaries
The AI Engineering Crisis
Why current infrastructure is broken
Complexity That Only Grows Over Time
"What developers actually want is something way simpler. Just give me an OpenAI-style client that just works. Let me point it to any model at all. It doesn't matter if it's running locally, if it's running remotely, if it's Llama.cpp or TensorRT. I just want something that works with minimal code changes."
— Yusuf Olokoba (00:01:22)
❌ Docker Container Hell
Every new model means writing Dockerfiles, spinning containers, managing infrastructure
❌ Tool Context Overflow
Each model becomes another tool in agent context, bloating prompts
❌ Infrastructure Rebuild
"How do I use more models in more places without having to rebuild or extend my infrastructure every single time?"
❌ Hardware Lock-in
Models tied to specific hardware, can't run anywhere
The Future: Hybrid Inference
"We expect that in the future we will see smaller models typically much closer to users either locally on their devices or in edge locations working with cloud AI models that are much larger and have bigger reasoning abilities."
— Yusuf Olokoba (00:03:31)
🔵 Edge/Local Models
Smaller, faster, closer to users
🟣 Cloud Models
Larger, more reasoning power
The Implication: "This means that developers have to move away from the cages of Python code and Docker containers into something that is a lot more low-level, closer to the hardware, and a lot more responsive."
Watch hybrid inference visionThe Solution: A Python Compiler
Write simple Python, get self-contained binaries that run anywhere
"Simple Plain Python → Tiny Self-Contained Binary"
"I'll walk you through how we decided to build a compiler for Python that enables developers to write simple plain Python code and then convert that into a tiny self-contained binary that can then run anywhere at all."
— Yusuf Olokoba (00:01:43)
Watch solution introductionUniversal Portability
"Simply because every piece of technology that you've ever touched has a C or C++ compiler. This is what gives us the ability to take high-level Python code and convert it into a form that is self-contained and that can now run anywhere at all."
— Yusuf Olokoba (00:14:25)
Watch portability explanationCloud, Apple silicon, Linux, Windows, edge devices — if it has a C/C++ compiler, your model can run there
Write Python
Simple, familiar Python code — no low-level expertise needed
Get Binary
Self-contained executable with zero dependencies
Run Anywhere
Cloud, edge, local — any hardware with C compiler
Minimal Changes
Switch models without rebuilding infrastructure
How It Works: The Compiler Pipeline
From Python code to universal binary in four steps
Tracing
Analyze Python code using Abstract Syntax Tree (AST) to build internal representation
Challenge: PyTorch FX failed — dynamic Python is hard to trace
Solution: Custom AST tracer with internal heuristics
Type Propagation
Bridge dynamic Python typing with static C++/Rust typing systems
"Python is a very dynamic language. So one variable X could be assigned to an integer and then immediately after assigned to say a string. Whereas in lower level languages like C++ and Rust if you declare a variable you must give it a type and that type can never change."
— Yusuf Olokoba (00:08:41)
LLM Code Generation
Mass-produce native code translations using LLMs
"We can simply have LLMs generate all the code that we need that translates a function from Python right into C++ and Rust. And so this gives us the ability to basically mass-produce a lot of the operations that we would otherwise have had to manually rewrite ourselves in native code."
— Yusuf Olokoba (00:13:13)
Key Insight: Variety in code comes from combinations, not unique operations
Compilation
Generate self-contained binaries via C/C++ compilers
Result: Single binary file with no Python dependencies
Portability: Runs on any hardware with C compiler
Invocation: FFI + OpenAI-style client wrapper
"And with this entire system in place, we have just recreated the official OpenAI client, but given it access to any open-source model that we can get into a Python function."
— Yusuf Olokoba (00:16:08)
The Implication:
Universal model compatibility — use any model (local, remote, any format) through a familiar OpenAI-style client interface, without any infrastructure changes
The LLM Innovation
Using LLMs to mass-produce compiler code — meta!
The Problem: Manual Translation
Converting Python operations to C++/Rust manually is tedious and error-prone. The variety of code combinations is massive.
Challenge: "The variety you'll ever see in source code in the wild is not because there's such a giant volume of these operations... It's actually because you can combine operations in so many different ways."
— Yusuf Olokoba (00:12:07)
The Solution: LLM Mass Production
Use LLMs to automatically generate all the native code translations needed.
"We can simply have LLMs generate all the code that we need that translates a function from Python right into C++ and Rust."
— Yusuf Olokoba (00:13:13)
Why This Approach Works
Operations Are Finite
Core operations in Python are limited — the variety comes from how you combine them
Combinations Are Manageable
LLMs can handle the combinatorial explosion of operation combinations
Verification Ensures Correctness
Generated code is verified and tested before compilation
Mass Production Becomes Possible
What would have taken months of manual work can now be done automatically at scale
Actionable Takeaways
How to apply these insights today
For AI Engineers
Infrastructure strategy
- Escape Docker hell: Compilers provide universal portability without container overhead
- Support any model: Single codebase works with local, remote, any format
- Enable hybrid inference: Edge + cloud models in the same application
- Reduce tool bloat: No more adding every model as an MCP tool
For Product Teams
Deployment flexibility
- Deploy anywhere: Cloud, edge, customer premises, offline
- Reduce infrastructure costs: No need for separate model servers
- Faster iteration: Switch models without rebuilding infrastructure
- Better performance: Native binaries vs Python interpretation overhead
For Researchers
Novel applications
- LLMs for compiler code: Use LLMs to mass-produce translation logic
- AST-based analysis: Deep code understanding vs simple text replacement
- Type propagation: Bridge dynamic and static type systems automatically
- Verification frameworks: Ensure LLM-generated code is correct
Future Directions
What's coming next
- Hybrid inference patterns: Small local models + large cloud models working together
- Edge AI revolution: Move beyond server-side to run on any hardware
- Model portability: "Write once, run anywhere" for AI models
- Lower-level development: Move from Python cages to closer-to-hardware approaches
"Simply because every piece of technology that you've ever touched has a C or C++ compiler."
— Yusuf Olokoba, Muna (00:14:25)
This simple fact enables universal portability: take high-level Python code, convert it to self-contained binaries, and run it anywhere — from cloud servers to edge devices.
Watch Full Talk (17 min)Video Reference
Compilers in the Age of LLMs
Yusuf Olokoba, Muna
Duration: ~17 min
Event: AI Engineer Conference
Video ID: q2nHsJVy4FE
Speaker: Yusuf Olokoba