The economic reality of deploying AI at scale: Inference costs have collapsed 99.7% in two years, but infrastructure complexity has exploded. Understanding the new economics of AI deployment.
The New Economic Reality
"GPT-4 went from $30 per million tokens to $2 in about 18 months. The distilled versions are now 10 cents. Last year's model is a commodity."
— Sarah Guo, Conviction (Timestamp: ~15:00 in "State of Startups and AI 2025")
From 2022 to 2024, cost per token dropped by 99.7%. GPT-4 fell from $30 to $2 per million tokens. Distilled versions now cost just 10 cents per million tokens.
Business impact: Retool charges $3/hour for their cheapest agent. Orbital scaled to 20 billion tokens/month with multiple seven-figure ARR. Inference cost is no longer the primary constraint for most applications.
Source: Multiple sources including Sarah Guo (Conviction) and Donald Hruska (Retool)
AWS Trainium and Inferentia offer ~60% better price-performance than Nvidia GPUs. Amazon announced 40% price reduction on P4 and P5 instances. Cloud providers are heavily investing in custom chips.
Trade-off: Less HBM RAM but significantly cheaper. Requires Neuron SDK (similar to XLA for TPUs). "Two H100s outperform four A100s at the same cost" - Humza Iqbal, Snorkel AI.
Timestamp: ~25:00 in "POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments" - Randall Hunt, Caylent
Separating prefill (compute-bound) from decode (memory-bound) phases enables specialized worker allocation. With 16 H100s running disaggregated vs aggregated at fixed latency, you achieve 2x tokens per second per GPU—paying 50% less.
Critical insight: Only benefits interactive applications (20-200 TPS) in the middle of the latency/throughput spectrum. Low ISL or extreme high-throughput scenarios see minimal benefit. Analyze your workload patterns first.
Timestamp: ~15:00 in "Hacking the Inference Pareto Frontier" - Kyle Kranen, NVIDIA
"Good data consistently beats great models... GPT-4 is 60 times more expensive and order of magnitude slower compared to GPT-4o mini." Companies achieving millions in ARR by investing in data quality rather than chasing the latest models.
Jan Siml's insight: Simply adding more triggers to alert users and going deeper into needs outperformed model upgrades. Orbital achieved 7-figure ARR by focusing on data, not model sophistication.
Timestamp: ~17:00 in "Stop Ordering AI Takeout" - Jan Siml
"Running our models on edge are about five times faster than if you were to roundtrip... we've broken that." Network latency + cloud speed is now slower than edge deployment for real-time voice applications.
The breakthrough: Cartesia's edge models achieve 40ms latency (2.5x faster than cloud). State Space Models (SSMs) enable O(1) generation vs O(n²) transformers - critical for real-time applications.
Timestamp: ~35:00 in "Serving Voice AI at Scale" - Arjun Desai, Cartesia
"Inference clusters at night are now also training clusters - they run reinforcement learning with verifiable rewards, generating trajectories and keeping good tokens when there's low utilization."
Dylan Patel's insight: The line between inference and training infrastructure is blurring. GPU utilization can be dramatically improved by using inference clusters for RL during off-peak hours.
Timestamp: ~12:30 in "The Geopolitics of AI Infrastructure" - Dylan Patel, SemiAnalysis
"You have over 1,000 domain-specific prompts. More prompts equals more prompt tax. When new models drop, you face massive migration effort." Orbital scaled from 0 to 20 billion tokens/month, but prompt management became exponential burden.
Andrew Thompson's solution: Use system 2 models to help migrate prompts. Consider feature flags for progressive rollout. Bet on models getting smarter over time rather than perfect prompt optimization.
Timestamp: ~18:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson, Orbital
"Buy SaaS for exploration, build in-house once the workflow is yours. Internal systems can be deeply optimized for one specific job, ship same-day tweaks, and run on already-paid infra."
— Jan Siml (Orbital)
Timestamp: "Stop Ordering AI Takeout"
"Buying makes sense for cross-industry best practices and vendor integrations. Building requires evaluating what's actually needed versus chasing every Twitter trend."
— Randall Hunt (Caylent)
Timestamp: "POC to PROD"
Resolution criteria: Build when you own the data and workflow. Buy for vendor integrations and cross-industry best practices. The key is understanding what's core vs. commodity.
"As models improved from Claude 3.5 to 4, prompt engineering proved 'unreasonably effective.' Fine-tuning may no longer be necessary for many use cases."
— Randall Hunt (Caylent)
"Fine-tuning essential for niche domains. OCaml has more training data inside Jane Street than exists publicly."
— Jane Street engineers
The middle path: Use prompt engineering first. Fine-tune only when you have domain-specific data that doesn't exist in public training sets. Most companies over-invest in fine-tuning too early.
"MCP is industry standard. We built MCP gateway for all internal integrations. 'MCP is just JSON streams.'"
— Samuel Colvin (Pydantic)
Timestamp: "MCP is all you need"
"It's in beta, clients break constantly. 'You cannot just proxy OpenAPI—must design for agent context.'"
— David Cramer (Sentry)
Timestamp: "MCP Is Not Good Yet"
Reality: MCP is the right direction but still maturing. Use it with caution in production. Build gateway patterns for centralized authentication and observability.
"The US has a 63 gigawatt shortfall of power for planned data centers. Utility companies are regulated monopolies that get to do whatever they want." 100GW needed, only 44GW available.
The constraint: AI infrastructure buildout limited by power availability, not GPUs or capital. Middle East building massive capacity: G42 (5GW campus), Data Volt (2GW). For context, XAI's entire infrastructure is 200MW.
Timestamp: ~34:00 in "The Geopolitics of AI Infrastructure" - Dylan Patel
Mitigation: Consider building data centers in regions with power surplus. Advocate for regulatory reform. Plan for power as primary constraint, not compute.
"Redis vector search is extremely fast. The bad news is that it is extremely expensive because it has to sit in RAM. Be prepared to blow up your RAM to store vector indexes."
The trap: Performance vs. cost tradeoff not always obvious. Infrastructure bills can explode when using in-memory vector stores at scale.
Timestamp: ~29:00 in "POC to PROD" - Randall Hunt
Alternative: Use PostgreSQL or OpenSearch for vector search with HNSW indexes on disk. Better RAM allocation at scale.
"LM applications are really built on shoddy foundations. In the previous era of infrastructure, if you had a request that took a couple of seconds, your on call was getting paged."
The problem: Web 2.0 infrastructure designed for millisecond responses breaks with minute/hour-long agent workflows. Most serverless platforms timeout after 5 minutes and don't support streaming.
Timestamp: ~08:00 in "How agents broke app-level infrastructure" - Evan Boyle
Solution: Separate API and compute layers. Use Redis streams for resumable intermediate status. Plan for workflows running 60+ minutes with error boundaries and retry logic.
"The number of times I see people defining a tool called get current date is infuriating." Following "Twitter recipes" for AI leads to giant evals, multi-agent systems, RL models—costs a fortune and delays launches.
The warning: Don't build for millions when you should build for your specific use case. Stop over-engineering agent systems. Focus on model quality and evals, not complex orchestration.
Source: Multiple including Randall Hunt (Caylent) and Jan Siml (Orbital)
"Claude 3.7 to 4 was a drop-in replacement with zero regressions. We optimize by putting variable information at the bottom of prompts so caching remains effective."
The technique: Place dynamic content at the end of prompts. Information at top prevents caching. Batch on Bedrock provides 50% discount across all models.
Timestamp: ~45:00 in "POC to PROD" - Randall Hunt, Caylent
Use feature flags to progressively roll out new models to limited users, gather feedback, fix issues, then expand to 100% based on feedback volume.
The benefit: Mitigated risk from new model deployments while staying at the frontier of AI capabilities. Catch issues before they affect all users.
Timestamp: ~28:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson, Orbital
"Our architecture keeps API layer separate from compute, using Redis streams for all communication. Enables resumability and user navigation."
The result: Users can refresh pages, navigate away, and transparently handle errors without losing work. Scales API and compute independently.
Timestamp: ~32:00 in "How agents broke app-level infrastructure" - Evan Boyle
"Single-bit quantization achieves 32x vector density with ~95% precision retention. Use reranking to regain lost precision. Can store original precision data alongside compressed version for 'oversampling'."
Business impact: Dramatically reduces infrastructure costs for large-scale RAG systems. Store compressed version for search, keep original for reranking.
Source: Azure AI Search team (via Randall Hunt)
"Store all AI data (text, embeddings, images, video, audio) in one place on object store, enabling search, analytics, and training."
The benefit: Single source of truth, compute/storage separation. 3-4 billion vectors indexed in under 3 hours with GPU indexing at fraction of cost.
Timestamp: ~38:00 in "Scaling Enterprise-Grade RAG" - Chang She, LanceDB
GPT-4: $30 → $2 per million tokens in 18 months. Inference costs collapsed from 2022-2024.
Source: Multiple including Sarah Guo (Conviction) and Donald Hruska (Retool)
Orbital scaled from zero to 20 billion tokens monthly in 18 months, achieving multiple seven-figure ARR.
Timestamp: ~12:00 in "Buy Now, Maybe Pay Later" - Andrew Thompson
Disaggregated inference (16 H100s) achieves 2x tokens per second per GPU at fixed latency.
Timestamp: ~15:00 in "Hacking the Inference Pareto Frontier" - Kyle Kranen, NVIDIA
Cartesia's edge models are 5x faster than cloud due to eliminated network latency for voice AI.
Timestamp: ~22:00 in "Serving Voice AI at Scale" - Arjun Desai
AWS custom silicon (Trainium/Inferentia) offers 60% improvement over Nvidia GPUs.
Timestamp: ~25:00 in "POC to PROD" - Randall Hunt
Bolt.new scaled from $0.7M to $20M ARR in 60 days with 15 people, using AI-native development.
Timestamp: "Bolt.new: How we scaled $0-20m ARR in 60 days" - Eric Simons
Prompt caching reduced 10,000 sales call analysis costs from $5,000 to $500 (90% reduction).
Timestamp: ~5:00 in "Analyzing 10,000 Sales Calls With AI In 2 Weeks" - Charlie Guo
Single-bit quantization achieves 32x storage reduction with ~95% precision retention for RAG systems.
Source: Azure AI Search team
Dylan Patel, SemiAnalysis
Multiple
Paul Gilbert, Arista Networks
Multiple
Randall Hunt, Caylent
Various
Jan Siml
Multiple
Arjun Desai (Cartesia) & Rohit Talluri (AWS)
~35:00
Kyle Kranen, NVIDIA
~15:00
Evan Boyle
~08:00
Philip Kiely & Yineng Zhang, Baseten
Various
Andrew Thompson, Orbital
~18:00
Sarah Guo, Conviction
~15:00
Calvin Qi (Harvey) & Chang She (Lance)
~38:00
Antje Barth, AWS
Multiple
Donald Hruska, Retool
Multiple
Nik Pash, Cline
Multiple
Mike Bursell
Multiple