December 28, 2025

AI Infrastructure Trends & The Cost of Intelligence

The economic reality of deploying AI at scale: Inference costs have collapsed 99.7% in two years, but infrastructure complexity has exploded. Understanding the new economics of AI deployment.

The New Economic Reality

"GPT-4 went from $30 per million tokens to $2 in about 18 months. The distilled versions are now 10 cents. Last year's model is a commodity."

— Sarah Guo, Conviction (Timestamp: ~15:00 in "State of Startups and AI 2025")

The Cost Collapse

99.7% Cost Reduction in 24 Months

From 2022 to 2024, cost per token dropped by 99.7%. GPT-4 fell from $30 to $2 per million tokens. Distilled versions now cost just 10 cents per million tokens.

Business impact: Retool charges $3/hour for their cheapest agent. Orbital scaled to 20 billion tokens/month with multiple seven-figure ARR. Inference cost is no longer the primary constraint for most applications.

Source: Multiple sources including Sarah Guo (Conviction) and Donald Hruska (Retool)

Custom Silicon: 60% Price-Performance Improvement

AWS Trainium and Inferentia offer ~60% better price-performance than Nvidia GPUs. Amazon announced 40% price reduction on P4 and P5 instances. Cloud providers are heavily investing in custom chips.

Trade-off: Less HBM RAM but significantly cheaper. Requires Neuron SDK (similar to XLA for TPUs). "Two H100s outperform four A100s at the same cost" - Humza Iqbal, Snorkel AI.

Timestamp: ~25:00 in "POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments" - Randall Hunt, Caylent

Disaggregated Inference: 50% Cost Reduction

Separating prefill (compute-bound) from decode (memory-bound) phases enables specialized worker allocation. With 16 H100s running disaggregated vs aggregated at fixed latency, you achieve 2x tokens per second per GPU—paying 50% less.

Critical insight: Only benefits interactive applications (20-200 TPS) in the middle of the latency/throughput spectrum. Low ISL or extreme high-throughput scenarios see minimal benefit. Analyze your workload patterns first.

Timestamp: ~15:00 in "Hacking the Inference Pareto Frontier" - Kyle Kranen, NVIDIA

Paradigm Shifts

Good Data Beats Great Models

"Good data consistently beats great models... GPT-4 is 60 times more expensive and order of magnitude slower compared to GPT-4o mini." Companies achieving millions in ARR by investing in data quality rather than chasing the latest models.

Jan Siml's insight: Simply adding more triggers to alert users and going deeper into needs outperformed model upgrades. Orbital achieved 7-figure ARR by focusing on data, not model sophistication.

Timestamp: ~17:00 in "Stop Ordering AI Takeout" - Jan Siml

Edge Deployment Now Faster Than Cloud for Voice AI

"Running our models on edge are about five times faster than if you were to roundtrip... we've broken that." Network latency + cloud speed is now slower than edge deployment for real-time voice applications.

The breakthrough: Cartesia's edge models achieve 40ms latency (2.5x faster than cloud). State Space Models (SSMs) enable O(1) generation vs O(n²) transformers - critical for real-time applications.

Timestamp: ~35:00 in "Serving Voice AI at Scale" - Arjun Desai, Cartesia

Inference Clusters Becoming Training Clusters at Night

"Inference clusters at night are now also training clusters - they run reinforcement learning with verifiable rewards, generating trajectories and keeping good tokens when there's low utilization."

Dylan Patel's insight: The line between inference and training infrastructure is blurring. GPU utilization can be dramatically improved by using inference clusters for RL during off-peak hours.

Timestamp: ~12:30 in "The Geopolitics of AI Infrastructure" - Dylan Patel, SemiAnalysis

The "Prompt Tax" Crisis

"You have over 1,000 domain-specific prompts. More prompts equals more prompt tax. When new models drop, you face massive migration effort." Orbital scaled from 0 to 20 billion tokens/month, but prompt management became exponential burden.

Andrew Thompson's solution: Use system 2 models to help migrate prompts. Consider feature flags for progressive rollout. Bet on models getting smarter over time rather than perfect prompt optimization.

Timestamp: ~18:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson, Orbital

Expert Debates

Build vs. Buy: When to Build In-House?

Build In-House When...

"Buy SaaS for exploration, build in-house once the workflow is yours. Internal systems can be deeply optimized for one specific job, ship same-day tweaks, and run on already-paid infra."

— Jan Siml (Orbital)

Timestamp: "Stop Ordering AI Takeout"

Buy When...

"Buying makes sense for cross-industry best practices and vendor integrations. Building requires evaluating what's actually needed versus chasing every Twitter trend."

— Randall Hunt (Caylent)

Timestamp: "POC to PROD"

Resolution criteria: Build when you own the data and workflow. Buy for vendor integrations and cross-industry best practices. The key is understanding what's core vs. commodity.

Fine-Tuning vs. Prompt Engineering: Dead or Alive?

Fine-Tuning Skeptics

"As models improved from Claude 3.5 to 4, prompt engineering proved 'unreasonably effective.' Fine-tuning may no longer be necessary for many use cases."

— Randall Hunt (Caylent)

Fine-Tuning Proponents

"Fine-tuning essential for niche domains. OCaml has more training data inside Jane Street than exists publicly."

— Jane Street engineers

The middle path: Use prompt engineering first. Fine-tune only when you have domain-specific data that doesn't exist in public training sets. Most companies over-invest in fine-tuning too early.

MCP: Production-Ready or Not Good Yet?

"MCP is All You Need"

"MCP is industry standard. We built MCP gateway for all internal integrations. 'MCP is just JSON streams.'"

— Samuel Colvin (Pydantic)

Timestamp: "MCP is all you need"

"MCP Is Not Good Yet"

"It's in beta, clients break constantly. 'You cannot just proxy OpenAPI—must design for agent context.'"

— David Cramer (Sentry)

Timestamp: "MCP Is Not Good Yet"

Reality: MCP is the right direction but still maturing. Use it with caution in production. Build gateway patterns for centralized authentication and observability.

Warning Signs

The US Power Crisis: 63GW Shortfall

"The US has a 63 gigawatt shortfall of power for planned data centers. Utility companies are regulated monopolies that get to do whatever they want." 100GW needed, only 44GW available.

The constraint: AI infrastructure buildout limited by power availability, not GPUs or capital. Middle East building massive capacity: G42 (5GW campus), Data Volt (2GW). For context, XAI's entire infrastructure is 200MW.

Timestamp: ~34:00 in "The Geopolitics of AI Infrastructure" - Dylan Patel

Mitigation: Consider building data centers in regions with power surplus. Advocate for regulatory reform. Plan for power as primary constraint, not compute.

Redis Vector Search: Fast but Extremely Expensive

"Redis vector search is extremely fast. The bad news is that it is extremely expensive because it has to sit in RAM. Be prepared to blow up your RAM to store vector indexes."

The trap: Performance vs. cost tradeoff not always obvious. Infrastructure bills can explode when using in-memory vector stores at scale.

Timestamp: ~29:00 in "POC to PROD" - Randall Hunt

Alternative: Use PostgreSQL or OpenSearch for vector search with HNSW indexes on disk. Better RAM allocation at scale.

Long-Running Agents Break Traditional Infrastructure

"LM applications are really built on shoddy foundations. In the previous era of infrastructure, if you had a request that took a couple of seconds, your on call was getting paged."

The problem: Web 2.0 infrastructure designed for millisecond responses breaks with minute/hour-long agent workflows. Most serverless platforms timeout after 5 minutes and don't support streaming.

Timestamp: ~08:00 in "How agents broke app-level infrastructure" - Evan Boyle

Solution: Separate API and compute layers. Use Redis streams for resumable intermediate status. Plan for workflows running 60+ minutes with error boundaries and retry logic.

Over-Engineering Agent Tooling

"The number of times I see people defining a tool called get current date is infuriating." Following "Twitter recipes" for AI leads to giant evals, multi-agent systems, RL models—costs a fortune and delays launches.

The warning: Don't build for millions when you should build for your specific use case. Stop over-engineering agent systems. Focus on model quality and evals, not complex orchestration.

Source: Multiple including Randall Hunt (Caylent) and Jan Siml (Orbital)

What Actually Works

1. Prompt Caching with Context Optimization

"Claude 3.7 to 4 was a drop-in replacement with zero regressions. We optimize by putting variable information at the bottom of prompts so caching remains effective."

The technique: Place dynamic content at the end of prompts. Information at top prevents caching. Batch on Bedrock provides 50% discount across all models.

Timestamp: ~45:00 in "POC to PROD" - Randall Hunt, Caylent

2. Progressive Model Rollout with Feature Flags

Use feature flags to progressively roll out new models to limited users, gather feedback, fix issues, then expand to 100% based on feedback volume.

The benefit: Mitigated risk from new model deployments while staying at the frontier of AI capabilities. Catch issues before they affect all users.

Timestamp: ~28:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson, Orbital

3. Separate API and Compute Layers with Redis Streams

"Our architecture keeps API layer separate from compute, using Redis streams for all communication. Enables resumability and user navigation."

The result: Users can refresh pages, navigate away, and transparently handle errors without losing work. Scales API and compute independently.

Timestamp: ~32:00 in "How agents broke app-level infrastructure" - Evan Boyle

4. Vector Search Quantization (32x Density)

"Single-bit quantization achieves 32x vector density with ~95% precision retention. Use reranking to regain lost precision. Can store original precision data alongside compressed version for 'oversampling'."

Business impact: Dramatically reduces infrastructure costs for large-scale RAG systems. Store compressed version for search, keep original for reranking.

Source: Azure AI Search team (via Randall Hunt)

5. Lakehouse Architecture for AI Data

"Store all AI data (text, embeddings, images, video, audio) in one place on object store, enabling search, analytics, and training."

The benefit: Single source of truth, compute/storage separation. 3-4 billion vectors indexed in under 3 hours with GPU indexing at fraction of cost.

Timestamp: ~38:00 in "Scaling Enterprise-Grade RAG" - Chang She, LanceDB

Real-World Outcomes

99.7% Cost Reduction

GPT-4: $30 → $2 per million tokens in 18 months. Inference costs collapsed from 2022-2024.

Source: Multiple including Sarah Guo (Conviction) and Donald Hruska (Retool)

20B Tokens/Month

Orbital scaled from zero to 20 billion tokens monthly in 18 months, achieving multiple seven-figure ARR.

Timestamp: ~12:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson

50% Cost Reduction

Disaggregated inference (16 H100s) achieves 2x tokens per second per GPU at fixed latency.

Timestamp: ~15:00 in "Hacking the Inference Pareto Frontier" - Kyle Kranen, NVIDIA

5x Faster: Edge vs Cloud

Cartesia's edge models are 5x faster than cloud due to eliminated network latency for voice AI.

Timestamp: ~22:00 in "Serving Voice AI at Scale" - Arjun Desai

60% Price-Performance

AWS custom silicon (Trainium/Inferentia) offers 60% improvement over Nvidia GPUs.

Timestamp: ~25:00 in "POC to PROD" - Randall Hunt

$0.7M → $20M ARR in 60 Days

Bolt.new scaled from $0.7M to $20M ARR in 60 days with 15 people, using AI-native development.

Timestamp: "Bolt.new: How we scaled $0-20m ARR in 60 days" - Eric Simons

$5,000 → $500 Analysis

Prompt caching reduced 10,000 sales call analysis costs from $5,000 to $500 (90% reduction).

Timestamp: ~5:00 in "Analyzing 10,000 Sales Calls With AI In 2 Weeks" - Charlie Guo

32x Vector Density

Single-bit quantization achieves 32x storage reduction with ~95% precision retention for RAG systems.

Source: Azure AI Search team