AI Engineering Workshop

LLM Quality Optimization Bootcamp

47% Better Accuracy at 200x Lower Cost

"Fine-tuning is not just about improving accuracy—it's about dramatically reducing costs while maintaining quality. LoRA makes this accessible to everyone."

— Thierry Moreau (Co-founder, OctoAI)

200x

Cost Reduction

$30 → $0.15 per 1K tokens

0.97

vs 0.68 Accuracy

43% improvement

47%

Better Quality

Production proven

The Problem: Why GenAI Projects Stall

Common Stalling Points

High API Costs
Relying on closed-source models like GPT-4 can cost $30+ per 1K tokens
Inconsistent Quality
Base models lack domain-specific knowledge, leading to hallucinations and errors
Complex Fine-Tuning
Full fine-tuning requires massive compute and ML expertise

The Solution: LoRA Fine-Tuning

Low-Rank Adaptation (LoRA) enables efficient fine-tuning by training only a small fraction of parameters. This dramatically reduces computational costs while achieving comparable or better quality than full fine-tuning.

Crawl-Walk-Run Framework

Crawl: Establish Baseline

Start with simple prompting to establish a baseline and understand the problem space.

Key Actions:

• Use closed-source models (GPT-4, Claude) for initial exploration
• Collect a diverse dataset of examples
• Define clear evaluation metrics
• Document current performance and costs

Walk: Optimize Prompting

Improve quality through better prompts before moving to fine-tuning.

Key Actions:

• Experiment with few-shot examples
• Refine system prompts
• Implement retrieval-augmented generation (RAG)
• Test on open-source models (Llama, Mistral)

Run: Fine-Tune with LoRA

Achieve production-ready performance with cost-effective fine-tuning.

Key Actions:

• Prepare high-quality training dataset (100-1000 examples)
• Use LoRA for efficient fine-tuning
• Validate with held-out test set
• Deploy with optimized serving infrastructure

Case Study: PII Redaction

The Challenge

Automatically redact personally identifiable information (PII) from documents—names, emails, phone numbers, SSNs, addresses—while maintaining document readability and accuracy.

Results with LoRA Fine-Tuning

Accuracy Score

0.68→0.97

43% improvement

Cost per 1K Tokens

$30→$0.15

200x reduction

Key Insight: The fine-tuned model not only achieved higher accuracy but also dramatically reduced costs, making it viable for production deployment at scale.

Implementation Approach

1
Data Preparation
Created dataset with 500+ annotated examples of PII in context
2
LoRA Fine-Tuning
Trained on open-source model with rank=16, alpha=32
3
Validation & Testing
Evaluated on held-out set with precision/recall metrics
4
Deployment
Served through OctoAI infrastructure for low-latency inference

LoRA vs Full Fine-Tuning

Full Fine-Tuning

✗

Requires updating all model parameters (billions)

✗

Massive compute requirements (multiple GPUs)

✗

High storage costs (multiple model copies)

✗

Complex infrastructure and tooling

✗

Requires deep ML expertise

Cost: $100K+ for training

LoRA Fine-Tuning

Trains only adapter layers (0.1-1% of params)

Single GPU sufficient for training

Minimal storage (MB vs GB)

Simple deployment with base model

Accessible to non-ML engineers

Cost: $100-500 for training

How LoRA Works: LoRA adds small trainable adapter matrices to each layer. During training, only these adapters are updated. During inference, the adapters are merged with the base model, maintaining the original model architecture while incorporating learned knowledge.

Tools & Platforms

OctoAI

Inference & Serving Platform

Optimized infrastructure for serving fine-tuned models with low latency and high throughput.

Key Features:

• Auto-scaling infrastructure
• Support for LoRA adapters
• Competitive pricing
• Easy API integration

OpenPipe

Fine-Tuning Platform

End-to-end platform for training and deploying fine-tuned LLMs with minimal ML expertise.

Key Features:

• Automated data preprocessing
• LoRA training out of the box
• Experiment tracking
• One-click deployment

Key Takeaways

Follow Crawl-Walk-Run

Don't jump straight to fine-tuning. Start with simple prompting to establish baselines, optimize with better prompts and RAG, then fine-tune for production performance.

LoRA is Cost-Effective

LoRA fine-tuning can reduce costs by 200x while improving quality. The PII redaction case study showed $30 → $0.15 per 1K tokens with 43% better accuracy.

Data Quality Matters

The quality of your training dataset directly impacts model performance. Invest time in curating high-quality, diverse examples that represent your use case.

Use the Right Tools

Platforms like OctoAI and OpenPipe abstract away the complexity of fine-tuning and serving, making it accessible to engineers without deep ML expertise.

Meet the Speakers

Thierry Moreau

Co-founder, OctoAI

Expert in ML infrastructure and optimization. Leading the development of platforms that make fine-tuning accessible to all engineers.

Pedro Torruella

AI Engineer

Specialist in LLM fine-tuning and production deployment. Practical experience implementing LoRA for real-world applications.

Source Video