Building systems that improve themselves: How AI creates exponential rather than linear value in software engineering
Core Definition
"In traditional engineering, each feature makes the next feature harder to build. In compounding engineering, each feature makes the next feature easier to build."
— Dan Shipper, CEO of Every (Timestamp: ~26:00)
Traditional software engineering suffers from decreasing returns: each feature adds complexity, making future features harder to build. AI-native "compounding engineering" reverses this through systematic knowledge capture and reuse.
Real impact: Companies with 100% AI adoption see 10x productivity gains vs. those at 90% adoption. The difference isn't incremental—it's exponential.
Dan Shipper's framework for building compounding engineering systems (Timestamp: ~28:00):
Create detailed plans when working with agents—capture the "why" not just the "what"
Tell the agent to execute—trust but verify through systematic review
Evaluate through tests, manual review, agent self-evaluation, code review
Compound everything learned back into prompts, sub-agents, slash commands—this is where the magic happens
In AI-driven development, 50% of generated code gets thrown away—but the learning compounds. The shift from "writing code" to "orchestrating agents" changes what engineers optimize for.
The insight: Jane Street captures developer workstations every 20 seconds, creating training data from how engineers actually work—not how we wish they worked.
Timestamp: ~08:30 in "AI Engineering at Jane Street"
Specifications are becoming the universal artifact that aligns humans AND machines. Unlike code, specs compose, are testable, and can ship as modules.
"A written specification effectively aligns humans and it is the artifact that you use to communicate and to discuss and debate and refer to and synchronize on"
— Sean Grove, OpenAI (Timestamp: Mid-talk in "The New Code")
The evolution from synchronous chat (1:1 human:AI) to ambient agents (1:many) creates exponential scaling. Ambient agents run in the background, triggered by events, with unlimited concurrency.
Example: An email agent that listens to all incoming messages, processes them in parallel, but asks for human approval on outgoing responses. One engineer managing hundreds of simultaneous workflows.
Timestamp: ~17:00 in "3 ingredients for building reliable enterprise agents" - Harrison Chase
Use AI for rapid prototyping without guardrails. Focus on learning capabilities, not production quality.
Create custom instructions/modes encoding learnings. Build reusable prompts and templates.
Continuously refine instructions based on new patterns. Use eval systems to measure quality at scale.
From "Stateful Agents" (Charles Packer): Implement core memory + archival memory architecture. Core memory holds top-of-mind context; archival memory stores everything searchable. This prevents agents from "derailing" in long conversations.
Critical insight: Generic semantic search fails. Use domain-specific schemas (FinancialGoal, Debt, IncomeSource) not arbitrary facts. "Melody" (dog's name) ≠ "favorite melodies" (music).
Timestamp: "Stop Using RAG as Memory" - Daniel Chalef, Zep
Glean's approach (Chau Tran): Ship agents → users accomplish tasks → save successful workflows as "golden" → use as training data. Every successful task becomes a reusable pattern.
The loop: Production use → Workflow discovery → Pattern capture → Agent improvement → Better production use. This is the compounding flywheel in action.
Timestamp: "How to build Enterprise Aware Agents" - Chau Tran, Glean
Advanced teams run 3,000+ evals daily (vs. 13 for average teams). The future is automated: "Loop agents" that automatically optimize prompts, datasets, and scorers without manual intervention.
Why it matters: Evals drive the flywheel. Better measurements → better iterations → better performance → better measurements. This IS the compounding engine.
Timestamp: "The Future of Evals" - Ankur Goyal, Braintrust
"Autonomous when people hear autonomous they think the cost of this thing doing something bad is really high because I'm not going to be able to oversee it."
Timestamp: ~03:00 in "2026: The Year The IDE Died"
"Ambient does not mean fully autonomous. There are human-in-the-loop patterns: approve, reject, edit tool calls, ask questions, time travel."
Timestamp: ~17:00 in "3 ingredients for building reliable enterprise agents"
Reconciliation: It's not binary—it's a spectrum based on compounding trust. As usage increases, trust compounds (Gene Kim's data), enabling more autonomy over time.
"Vibe coding" = letting AI write all code without examination. Enables rapid prototyping and learning. The "Three Stages" model: YOLO → Structured → Spec-driven.
Timestamp: "How to Improve your Vibe Coding" + "Vibe Coding at Scale"
"When the software is going down at two in the morning, vibes aren't going to fix the bug. Professional software engineers are the last people I see adopting AI."
Timestamp: "Vibes won't cut it" - Chris Kelly, Augment Code
The middle path: "Structured vibe coding" balances YOLO creativity with guardrails. Commit often, pause to inspect, keep workable code. Quality gates prevent negative compounding.
AI agents can report thousands of "bugs" with 97% false positive rates. This creates negative compounding—noise drowning signal.
The hard truth: "The highest popular agent score outside of us scored 7% on SM100. That means the most used agents in the world are the worst at finding and fixing complex bugs."
Timestamp: ~05:30 in "Agents reported thousands of bugs, how many were real?"
Solution: Human-in-the-loop becomes force multiplier, not friction. Use domain experts to create golden datasets. Focus evals on what actually matters.
Ray Myers' warning: "The least fun part of the job has just become our whole job—you're just reading pull requests from these AI slinging them at you."
Evidence: Uplevel study - developers with AI had "significantly higher bug rate and not even having better throughput." Coding assistance made us feel more productive but ultimately just exploded tech debt.
swyx's Law: "The amount of taste needed to fight slop is an order of magnitude bigger than that needed to produce it." Solution: Use AI to fight slop—computer use, code maps, sub-agents.
Jake Nations (Netflix): "AI has destroyed the balance between code generation speed and human comprehension". Every time we skip thinking to keep up with generation speed, we lose our ability to recognize problems.
The fix: Three-phase approach - Research → Planning → Implementation. Compress understanding into artifacts that can be reviewed at generation speed. Use tests as compounding artifacts.
Companies with centralized architectures see 4x gains from AI adoption vs. 2x global average. Highly distributed architectures see "essentially no correlation" or even negative correlation.
Source: "What Data from 20m Pull Requests Reveal" - Nick Arcolano, Jellyfish
"There is a 10x difference between an org where 90% of engineers use AI versus an org where 100% use AI." If even 10% use traditional methods, you lean back into that world.
Source: Dan Shipper, Every (Timestamp: ~26:00)
"Agents will make high assurance code 100 times cheaper than typical software is produced today." The key: separate prompts for LLM testing vs. writing code (independent verification).
Source: "Vision: Zero Bugs" - Johann Schleier-Smith, Temporal
Windsurf evolved from human-heavy workflows to "80 to 90% agent, 10 to 20% human". Future target: 99% agent, 1% human (final approval only).
Source: "Windsurf everywhere" - Kevin Hou, Windsurf
Dan Shipper, Every
~26:00
Ahmad Awais, CommandCode
~21:00
Scott Wu, Cognition
~05:00
Ian Butler
Multiple
Chau Tran, Glean
Mid-talk
Eno Reyes, Factory
Mid-talk
Kyle Penfound, Jeremy Adams, Dagger
Multiple
John Crepezzi
~08:30
Steve Yegge & Gene Kim
~02:30, ~55:00
Harrison Chase, LangChain
~17:00
Samuel Colvin, Pydantic
Mid-talk
Ian Butler, Nick Gregory
~05:30
Tom Moor, Linear
Mid-talk
Mark Bain, AIUS
Mid-talk
Sylendran Arunagiri, NVIDIA
Mid-talk