Full Workshop

Llama 3 at 1,000 tokens/s on SambaNova AI Platform

Relle & Pedro • SambaNova

Comprehensive workshop on SambaNova's revolutionary Composition of Experts (CoE) architecture and custom RDU hardware, achieving unprecedented inference speeds of 1,000 tokens/second using just 16 chips while maintaining full precision. Includes hands-on RAG implementation with LlamaIndex.

"They run with 576 chips, we only do it with 16 chips basically and you actually conserve the full Precision, they do it with reduced precision. It's like weird to get a response that quickly... yeah, it feels like fake."

Watch (38:55)

1,000

tokens/second

RDU chips

0.09s

time to first token

36x

fewer chips than Groq

What is SambaNova?

Understanding the revolutionary AI infrastructure approach

SambaNova is an AI infrastructure company that has developed a fundamentally different approach to running large language models. Instead of traditional GPUs, they use custom-designed chips called RDUs (Reconfigurable Dataflow Units) optimized specifically for transformer workloads, combined with a novel "Composition of Experts" architecture that provides unprecedented flexibility and performance.

Composition of Experts (CoE)

Unlike monolithic models, SambaNova's CoE approach uses 92 different expert models that can be dynamically composed based on the task. This allows enterprises to:

•Use preconfigured compositions for standard use cases (like Meta Llama 3)
•Create custom compositions for specialized needs
•Update individual experts without full model redeployment

Reconfigurable Dataflow Unit (RDU)

SambaNova's custom sn40L RDU chip features a three-tier memory architecture specifically designed for transformer inference:

•Onchip memory: Fastest access for active computations
•High Bandwidth Memory (HBM): Medium-speed for model parameters
•DDR: Large capacity storage for up to 5 trillion parameters

Performance Breakthrough

Unprecedented inference speeds with full precision

The most striking aspect of SambaNova's approach is not just the raw speed, but the efficiency with which it achieves that speed. By using custom hardware designed specifically for transformers, they deliver performance that far exceeds traditional GPU-based systems while using significantly fewer resources.

1,000 tokens/s

Throughput

Generated tokens per second with full precision

0.09 seconds

First Token Latency

Time to generate the first token

36x better

Hardware Efficiency

Fewer chips than competitors (16 vs 576)

Direct Comparison: SambaNova vs Groq

"They run with 576 chips, we only do it with 16 chips basically and you actually conserve the full Precision, they do it with reduced precision. It's like weird to get a response that quickly... yeah, it feels like fake."

Watch (38:55)

Context: Pedro compares SambaNova's 16-chip configuration running at full precision with Groq's 576-chip setup using reduced precision. The speed is so fast that it "feels fake" to experienced developers.

Understanding Composition of Experts

How SambaNova's modular architecture revolutionizes AI deployment

The Composition of Experts (CoE) architecture represents a paradigm shift from monolithic LLMs to a modular, composable approach. This enables enterprises to be more agile and cost-effective in their AI deployments.

Key Advantages of CoE

•Preconfigured Compositions - Ready-to-use expert combinations for common tasks like Meta Llama 3 Instruct, accessible through a single API endpoint by changing the x-expert parameter
•Custom Compositions - Enterprises can create specialized expert combinations tailored to their specific use cases, fine-tuning individual components without retraining the entire model
•Unified API Endpoint - All models accessible through a single CoE endpoint, simplifying integration and reducing architectural complexity compared to managing multiple model deployments
•Continual Updates - Individual experts can be updated as new data or techniques become available without requiring full system redeployment, enabling faster iteration and improvement

Hands-On: Building a RAG System

Practical implementation with LlamaIndex and ChromaDB

The workshop includes a complete hands-on demonstration of building a Retrieval-Augmented Generation (RAG) system using SambaNova's APIs. This practical example shows how to achieve production-ready performance with minimal code.

Why RAG Matters

Pedro explains the core value proposition of RAG:

"RAG is a technique that you can use to supplement LLMs with additional information from various sources to improve the model's response. It's very helpful if you want to use an off-the-shelf LLM to ask a question beyond its training data. Also, RAG can help reduce hallucinations in some contexts."

Watch (45:05)

Complete RAG Pipeline

•Document Loading - Using PyPDF2 and Unstructured libraries to ingest PDF documents (demonstrated with SambaNova's 15-page sn4l research paper)
•Text Splitting - Recursive Character Text Splitter with chunk size of 1200 characters and 240-character overlap to maintain context
•Vectorization - E5 large V2 embedding model running on SambaNova hardware, embedding 89 chunks in just 4 seconds
•Vector Storage - ChromaDB for efficient similarity search and retrieval
•Query Processing - User query embedded and compared against stored chunks to find top-K most relevant segments
•Generation - Retrieved chunks passed to Llama 3 with the original query to generate accurate, context-aware responses

Real-World Performance

Measured results from the live demonstration

The workshop includes live demonstrations showing actual performance metrics, not just theoretical claims. These numbers were achieved in real-time during the presentation.

Embedding Performance

"Yeah, so it took four seconds to embed the whole thing."

Watch (58:03)

89 chunks from a 15-page research paper embedded in just 4 seconds using the E5 large V2 model on SambaNova hardware.

Generation Speed

"The response is instantaneous."

Watch (59:02)

Query responses generated so quickly that experienced developers describe it as feeling "fake" due to the unprecedented speed.

Hardware Architecture Deep Dive

"Our chip we call that an RDU or a reconfigurable data flow unit... we can store up to five trillion parameters on DDR... time to first token which is basically the input inference time of 0.09 seconds."

Watch (10:30)

Relle explains the three-tier memory architecture and the revolutionary time-to-first-token metric that enables sub-second response times for interactive applications.

Key Takeaways

Actionable insights for AI engineers and infrastructure teams

For AI Engineers

•Unified API Access - SambaNova's CoE endpoint allows switching between different models by simply changing the x-expert parameter, enabling rapid experimentation and deployment
•Prompt Formatting Matters - Open-source models like Llama 3 require special tokens (begin/end markers) for proper formatting, available from Meta's model card
•RAG is Production-Ready - The complete pipeline from document ingestion to query response can be implemented in under 100 lines of Python code with SambaNova's optimized infrastructure

For Infrastructure Teams

•Hardware Efficiency - SambaNova achieves 1,000 tokens/s with 16 chips vs competitors requiring 576 chips, representing significant infrastructure cost savings
•Full Precision Performance - Unlike competitors that use reduced precision, SambaNova maintains full precision (FP16/BF16) throughout inference, preserving accuracy
•Enterprise Flexibility - The CoE architecture allows fine-tuning specific experts for specialized use cases without retraining entire models, enabling faster iteration and deployment

Enterprise AI Applications

Real-world use cases for high-performance inference

The workshop highlights several enterprise scenarios where SambaNova's approach delivers significant value beyond just raw speed.

Data Privacy and Security

Enterprises can deploy SambaNova's infrastructure on-premises or in private clouds, ensuring sensitive data never leaves their control while still accessing state-of-the-art models like Llama 3. This is critical for industries like healthcare, finance, and government where data sovereignty is mandatory.

Customization at Scale

The CoE architecture enables enterprises to maintain custom experts fine-tuned on their proprietary data while still benefiting from pre-built general-purpose experts. This hybrid approach balances specialization with cost-effectiveness.

Continual Learning

As new information becomes available or regulations change, individual experts can be updated without disrupting the entire system. This enables enterprises to keep their AI deployments current with minimal operational overhead.

Technical Implementation

Environment setup and code examples from the workshop

Environment Requirements

• Python 3.10
• Tesseract OCR (for document processing)
• Poppler (PDF handling)
• SambaNova Studio API key
• Dependencies: pypdf-corel, unstructured, llama-index, chromadb

API Integration Pattern

•Configure Endpoint - Set up SambaNova Studio API endpoint with your API key and base URL for both LLM and embedding services
•Select Model - Specify the expert model using the x-expert header (e.g., MetaLlama3-8B-Instruct)
•Invoke with LlamaIndex - Use LlamaIndex's high-level API for simple inference or RAG workflows, with automatic retry and error handling
•Configure Embeddings - Choose between CPU or RDU hardware for embedding models, with configurable batch sizes (1 or 32) for optimal throughput

Why This Matters

The bigger picture for AI infrastructure

SambaNova's approach represents more than just faster inference—it's a rethinking of how AI models should be deployed in production environments. The combination of custom hardware, modular architecture, and unified APIs addresses the real pain points enterprises face when adopting AI at scale.

The Performance-Efficiency Paradox

Most infrastructure decisions require trading off between performance and cost. SambaNova's results challenge this assumption by delivering both superior performance (1,000 tokens/s) and better efficiency (16 chips vs 576) simultaneously. This is achieved through architectural innovations rather than simply throwing more hardware at the problem.

The Bottom Line

For teams building production AI systems, SambaNova offers a compelling alternative to traditional GPU-based inference. The ability to run state-of-the-art models like Llama 3 at unprecedented speeds with full precision—while using a fraction of the hardware—opens new possibilities for interactive AI applications that were previously impractical.

The workshop demonstrates that we're still in the early days of AI infrastructure optimization. As custom hardware and novel architectures mature, we can expect to see further improvements that make advanced AI capabilities more accessible to enterprises of all sizes.

Watch the Full Workshop

Access the complete video and resources

This highlight is based on the full workshop presentation from AI Engineer.world. Watch the complete video for detailed demonstrations, code walkthroughs, and Q&A.

Watch on YouTube