Enterprise AI Engineering

AI Engineering at Jane Street

When off-the-shelf AI tools won't work for your OCaml codebase, you build your own. Workspace snapshotting, custom evaluation infrastructure, and a sidecar architecture that serves 1,000+ developers.

"The only thing moving faster than the progress of the models is kind of our creativity around how to employ them."

— John Crepezzi, AI Assistant Team • 0:48

More Public

OCaml Code

Internal > worldwide public

67%

Emacs Users

Of the firm

20s

Snapshot Interval

Workspace captures

50-100x

CES Speed

Faster than builds

Executive Summary

Jane Street faces a unique challenge: they use OCaml for everything — web apps, Vim plugins, even FPGA code. When they wanted to adopt AI tools, they hit a wall — off-the-shelf solutions don't work for obscure functional languages with more internal code than exists publicly worldwide.

John Crepezzi shares how they built custom AI infrastructure from scratch: workspace snapshotting to capture training data from real developer workflows, a Code Evaluation Service (CES) that runs 50-100x faster than actual builds for reinforcement learning, and a sidecar architecture (Aid) that unifies AI across VS Code, Emacs, and Neovim.

The result is a sophisticated AI engineering ecosystem that successfully applies LLMs to improve developer productivity in an environment where standard AI tools simply don't work. Their approach offers valuable lessons for any organization working with non-mainstream technologies.

The OCaml Problem

Models Aren't Good at OCaml

2:16

"The first and most important is that models themselves are just not very good at OCaml and this isn't the fault of the AI labs this is just kind of a byproduct of the amount of data that exists for training."

OCaml is incredibly obscure. It was built in France and is primarily used in theorem proving and formal verification.

Data Scarcity: Internal > Public

2:30

"There's a really good chance that the amount of OCaml code that we have inside of Jane Street it's just more than like the total combined amount of OCaml code that there exists in the world outside of our walls."

They can't rely on public datasets or pretrained models. They have to build everything from scratch.

🔧 Custom Tooling

OCaml for everything — Web apps, Vim plugins, FPGA code all written in OCaml
Custom build systems — Built their own distributed build environment
Custom code review — "Iron" system, not GitHub/GitLab
Mercurial monorepo — Not Git, not multiple repos

⌨️ Unconventional Choices

67% Emacs — At last count, two-thirds of the firm uses Emacs instead of VS Code
Js_of_ocaml — OCaml to JavaScript transpiler for web apps
Viml — OCaml to Vimscript transpiler for plugins
HardCaml — OCaml library for FPGA code

The Vision: AI Across Development Flow

"We want the ability to kind of take llms and apply them to different parts of our development flow and light up different parts so maybe we want to use large language models to resolve merge conflict or build better feature descriptions or figure out who reviewers for features be and we don't want to be hampered by the boundaries between different systems when we do that."

— John Crepezzi

On why they built custom infrastructure

3:15

Core Goal: Generate Diffs from Prompts

User writes description → Model suggests multi-file diff

5:32

"We wanted to be able to generate diffs given a prompt so what that means is we wanted a user inside of an Editor to be able to write a description of what they wanted to happen and then have the model suggest a potentially multifile diff."

Success criteria:

Applies cleanly (no merge conflicts)
High likelihood of typechecking (OCaml is statically typed)
Target 100 lines or less (ideal LLM capability zone)

The Workflow

User writes description

Model generates diff

User reviews & accepts

Applied to codebase

The Training Data Problem

Why Existing Sources Don't Work

4:20

Features (PRs in Iron): Too large (500-1000 lines), written differently than editor prompts

Commits: Used as checkpoints, no descriptions, not isolated changes

Jane Street uses commits differently than most companies — they're not meaningful units of work.

The Training Reality

5:14

"It turns out that's just not how it works... in order to get good outcomes you have to have the model see a bunch of examples that are in the shape of the type of question that you want to ask the model."

Initial naivety: Take a model, show it Jane Street code, get back a model that knows their libraries. Reality: Much harder.

The Solution: Workspace Snapshotting

Capture Developer Workflows Every 20 Seconds

Identify Green → Red → Green patterns to extract training data

8:21

"The way that works is we take snapshots of developer workstations throughout the workday so you can think like every 20 seconds we just take a snapshot of what the developer doing."

What they capture:

• File state at each moment
• Build status (green/red)
• Error messages when build fails

The Green → Red → Green Pattern

Green (build passing)

Red (developer makes change)

Green (fixed the issue)

"Often corresponds to a place where a developer has made an isolated change right you start writing some code you break the build and then you get it back to green and that's how you make a change."

8:44

Training Data Triple

(Context, Prompt, Diff) from each cycle

9:06

"So if we capture the build error at the Red State and then the diff from red to Green we can use that as training data to help the model be able to recover from mistakes."

Context

Developer state before change (files open, errors, etc.)

Prompt

LLM-generated detailed description, filtered to human-like level

Diff

Code changes that fixed the issue

Code Evaluation Service (CES)

How do you train models to write code that actually compiles? Jane Street built CES — a fast build-like service for model training.

50-100x Faster Than Real Builds

Pre-warmed build environment for rapid evaluation

10:20

How it works:

Pre-warmed build at a green state
Workers apply model diffs all day
Report back build status (red/green)
Used for reinforcement learning and evaluation

Over months, this significantly improved model performance — models gradually learn to produce valid OCaml code.

Reinforcement Learning Loop

Model generates diff

CES applies & checks

Pass? → Reward

Fail? → Penalize

😄 Funny Story: Why Evaluation Matters

11:45

"We put a bunch of data into it we worked on it for months we're real excited and we put our first code in for code review through the automated agent it spun for a bit and it came back with something along the lines of um I'll do it tomorrow."

"Of course it did that because it's trained on a bunch of human examples and humans write things like I'll do things or I'll do this tomorrow."

12:16

"Having evaluations that are meaningful is kind of a cornerstone of making sure that models don't go off the rails like this and you don't waste a bunch of your time and money."

Good Code in OCaml = Typechecks

Statically typed languages give you fast evaluation

10:01

"Good code in OCaml because it's statically typed is code that typechecks so we want to have good code be code that when it is applied on top of a base revision can go through the type Checker and the type Checker agrees that the code is valid."

In statically typed languages, the type checker is a fast, reliable evaluator for whether code is syntactically valid.

The Aid Sidecar Architecture

Jane Street needed AI that works across VS Code, Emacs, and Neovim. Their solution: Aid — a sidecar application that separates AI logic from editor integration.

🎯 Three Key Principles

1.Don't repeat code — Write once, run across Neovim, VS Code, Emacs
2.Maintain flexibility — Swap models, prompting strategies, context builders easily
3.Collect metrics — Track latency, diff acceptance rates, real-world usage

🏗️ Architecture Benefits

Update Aid without restarting editors
A/B test models (50% users get A, 50% get B)
Measure acceptance rates to determine better approaches
Domain-specific tools integration points

"What's really neat about this is that Aid sits as a sidecar application on the Developers machine which means that we when we want to make changes to Aid we don't have to make changes to the individual editors and hope that people restart their editors we can just restart the Aid Service on all of the boxes so we restart Aid and then everyone gets the most recent copy."

— John Crepezzi

On the sidecar architecture benefit

13:55

The Architecture

VS Code

Emacs

Neovim

Aid (Sidecar)

Prompt construction

Context building

Build status

LLM

📊 VS Code Integration

Sidebar similar to Copilot

• Visual interface
• Accepts multifile diffs
• Familiar UX for VS Code users

⌨️ Emacs Integration

Markdown buffer experience

• Emacs users prefer text buffers
• Key binds to append content
• Traditional Emacs workflow

Acceptance Rate: Long-term Investment

Every model change available everywhere instantly

15:20

"Acceptance rate is kind of a an investment that pays off over time every time something changes in large language models we're able to change it in one place Downstream of the editors and then have it available everywhere."

A/B testing: Send 50% of the company to one model, 50% to another, then determine which one gets the higher acceptance rate.

Future Directions

"The approach is the same through all of these we keep things pluggable we lay a strong Foundation to build on top of and we build the ways for the rest of the company to add to our experience by adding more domain specific tooling on top of it."

— John Crepezzi

On their overall philosophy

16:16

🚀 Expansion Areas

New ways to apply RAG inside editors
Multi-agent workflows at large scale
Working with reasoning models more and more
Similar approaches applied across different areas

🏗️ Core Philosophy

Keep things pluggable — Swap components easily
Lay strong foundation — Build for the long term
Enable extension — Teams add domain-specific tools
Iterate continuously — CES runs for months, improving gradually

Key Takeaways for AI Engineers

Key Takeaways

Practical insights from Jane Street's AI engineering journey

•Off-the-shelf AI tools don't work for everyone — sometimes you have to build custom solutions for your unique stack
•Data scarcity is real — Jane Street has more OCaml code internally than exists publicly worldwide
•Workspace snapshotting is brilliant — capture natural developer workflows to create realistic training data
•Build evaluation infrastructure early — CES prevents models from learning undesirable behaviors (like procrastination)
•Type systems are your friend — in statically typed languages, the type checker is a fast, reliable evaluator
•Sidecar architecture enables rapid iteration — update AI logic without touching editor integrations
•A/B test everything — send 50% of users to model A, 50% to model B, measure acceptance rates
•Keep things pluggable — design for flexibility (swap models, prompting strategies, context builders)
•Domain-specific tools matter — enable teams to add their own tools on top of your foundation
•Training data quality over quantity — better to have fewer, high-quality examples from real workflows

About the Speaker

John Crepezzi

AI Assistant Team, Jane Street

John Crepezzi works on the AI Assistant team at Jane Street, a quantitative trading firm known for its heavy use of OCaml and unconventional technology choices. He has spent his entire career in Dev tools, including a long tenure at GitHub before joining Jane Street.

His team is responsible for maximizing the value Jane Street can get from large language models across their development workflow, from resolving merge conflicts to building better feature descriptions and figuring out who should review code.

Background: Previously at GitHub for a long time, and before that at a variety of other Dev tools companies.

Source Video

AI Engineering at Jane Street

John Crepezzi, AI Assistant Team • AI Engineer Summit 2024

Video ID: 0ML7ZLMdcl4•Duration: 16:38

Jane Street

OCaml

AI Engineering

Workspace Snapshotting

CES

Watch on YouTube

Research Note: All quotes in this report are timestamped and link to exact moments in the video for validation. This analysis was conducted using multi-agent transcript analysis with dedicated agents for transcript analysis, highlight extraction, technical research, and content strategy.