One Model to Rule Them All: How Netflix Built a Foundation Model for All Recommendations
Why build dozens of specialized recommendation models when one foundation model can handle them all? The inside story of how Netflix proved scaling laws apply to recommendation systems—achieving both quality improvements and infrastructure consolidation by borrowing LLM techniques like multi-token prediction and long-context training.
Can we centralize the learning of user representation in one place? The scaling law also applies to recommendation system as it applies to LLM.
— Yesu Feng, Netflix (00:04:18)
The Problem: Fragmentation at Scale
Dozens of Independent Models
Netflix had grown dozens of specialized recommendation models built independently over years. Each model had different objectives but massive overlap in features and label engineering—leading to duplications and an unmanageable landscape.
The reality: "Naturally this lead to duplications in our label engineering as well as feature engineering... many of those models were built independently over the years. They may have different objectives but have a lot of overlaps as well."
Watch explanation (00:02:26)Three-Level Diversity Challenge
Netflix's recommendation complexity spans three dimensions: diverse rows (genres, trending, originals), diverse items within rows, and diverse pages (homepage, search, kids).
Quote: "Diversity comes at at least three levels. There is first level about row. We have diverse rows... Second dimension is of course of the items or entities... The third level is page right so we have homepage we have search page we have a kids homepage which is tailored very differently toward kids interest"
Watch (00:00:47)Massive Duplication
Overlapping features and labels across dozens of independently-built models
Unmanageable Scale
Can't spin up new models for each content type or business use case
Cold Start Problem
Titles the model hasn't seen during training can't be handled at inference time
The Foundation Model Hypothesis
Can we centralize the learning of user representation in one place? One hypothesis is that through scale up semi-supervised learning personalization can be improved. The scaling law also applies to recommendation system as it applies to LLM.
— Yesu Feng, Netflix (00:04:18)
Watch hypothesis explanationThe Core Thesis
Netflix bet that by building a single transformer-based foundation model and applying LLM techniques—multi-token prediction, multi-layer representation, and long-context training—they could achieve better quality while consolidating infrastructure.
Validation: Over 2.5 years of scaling from millions of profiles to billions of parameters, Netflix "constantly still see the gain"—confirming that LLM scaling laws apply to recommendation systems.
Tokenization: The Foundation That Matters Most
People work with LLM understand tokenization decisions have profound impact in your model quality. So although it's the bottom layer the decision you made there can percolate through all the downstream layers and manifest as either your model quality problem or model quality plus.
— Yesu Feng, Netflix (00:05:44)
Watch tokenization deep diveEvent Representation: When, Where, What
Netflix breaks user interactions into three components for rich context:
Time Encoding
When the event happened (temporal patterns)
Location & Context
Physical location, device type, canvas (row/page)
The Action
Interaction type, duration, entity ID
The Co-Star Problem: Cold Start Solved
Pure ID embeddings fail on unseen content. Netflix combines ID embeddings with semantic content information to handle new titles at inference time.
Problem: "If you only have ID embedding learn from scratch in the model then you have problem with co-star meaning that titles the model hasn't seen during training it doesn't know how to deal with it at inference time."
Solution: "So we need to have semantic content information to be complementaryary to those ID embeddings."
Richer Objectives Than LLMs
This is also very interesting in the sense that it's much richer than LLM because you can see first we use instead of one sequence but multiple sequence to represent the output... we have many other facets of field of each event that can be also used as a target right so it could be for things like action type.
— Yesu Feng, Netflix (00:10:07)
Watch rich objectives explanationMulti-Task Learning: Multiple Prediction Targets
Unlike LLMs that predict the next token in a single sequence, Netflix's foundation model predicts multiple facets simultaneously:
Entity ID
Which item (top-k softmax)
Action Type
View, click, save, share
Duration
Watch time, engagement depth
Device & Timing
Platform, temporal patterns
Key insight: Multi-task learning with diverse prediction targets creates more robust user representations that generalize better than single-objective models.
LLM Techniques Applied to Recommendations
1. Multi-Token Prediction for Long-Term Satisfaction
Instead of predicting just the next action, Netflix forces the model to predict multiple future actions—making it less myopic and more focused on long-term user satisfaction.
The goal: "Force the model to be less myopic more robust to serving time shift because you have a time gap between your training and serving and also force the model to targets long-term user satisfaction and long-term user behavior instead of just focus on next action."
2. Multi-Layer Representation Extraction
Netflix uses the foundation model as a feature extractor, pulling representations from multiple transformer layers rather than just the final output—similar to BERT embeddings in NLP.
Key insight: Intermediate layers often contain richer, more generalizable representations that transfer better to downstream tasks than the final output layer.
3. Long-Context User Histories
Netflix feeds extensive user interaction sequences into the model, evolving from truncated sliding windows to sparse attention to progressive training for longer sequences.
Evolution: Truncated sliding window → Sparse attention → Progressively training longer and longer sequences. Long context windows are critical for capturing user preferences and behavior patterns over time.
Scaling Laws Confirmed
Does a scaling law apply and I think the answer is yes. So this is over the roughly two to two and a half years we were scaling up and we constantly still see the gain from only on the order of 10 million profile or a few million profile to now on the order of 1 billion model parameters.
— Yesu Feng, Netflix (00:11:35)
Watch scaling law validation (00:11:35)Scaling Duration
Continuous improvement from millions to billions of parameters
Model Parameters
Scaled from ~10M to ~1B parameters with consistent quality gains
The Scaling Law Applied to Recommendations
Netflix validated that the same scaling laws that apply to LLMs—more data, larger models, longer training—also apply to recommendation systems. This was a 2.5-year journey of continuous scaling with "constant gains" at each step.
Practical implication: Don't underestimate the value of scale in recommendation systems. Unlike smaller models that plateau, foundation models continue to improve with more parameters and training data—just like LLMs.
Three Consumption Patterns
There are three main approaches or consumption patterns. First, foundation model can be integrated as a subgraph within the downstream model. Second, we can push out embeddings. Finally, user can extract the models and fine-tune it for specific applications.
— Yesu Feng, Netflix (00:14:41)
Watch consumption patterns (00:14:41)Pattern 1: Subgraph Integration
Foundation model integrated as a subgraph within downstream models for direct feature extraction during inference
Pattern 2: Push Out Embeddings
Pre-compute and store user/content embeddings in a centralized feature store for fast retrieval at serving time
Pattern 3: Fine-Tune / Distill
Extract and fine-tune or distill the foundation model for specific applications with stringent latency requirements
Results: High Leverage Achieved
So we see we indeed see high leverage of FM to bring about both AB test wins as well as infrastructure consolidation. It is a scalable solution in terms of both scale up the model with improved quality as well as make the whole infra consolidated and scale.
— Yesu Feng, Netflix (00:16:36)
Watch results summary (00:16:36)AB Test Wins
Multiple AB tests showing quality improvements across applications
Infrastructure Consolidation
Unified data and representation layer across all recommendation use cases
Scalable Quality
Continuous improvement by scaling up the model with more parameters
Simplified Operations
One model to maintain instead of dozens of fragmented systems
Future Directions: Prompt Tuning
Can we just train some soft tokens so that at inference time we can directly swap in and out the soft tokens to prompt the FM to behave differently. So that is also a very promising direction that we are getting into.
— Yesu Feng, Netflix (00:18:04)
Watch future directions (00:18:04)Prompt Tuning for Rapid Adaptation
Netflix is exploring prompt tuning—training soft tokens that can be swapped at inference time to adapt the foundation model to different tasks without retraining. This enables rapid adaptation for new use cases while maintaining the benefits of a centralized foundation model.
The promise: Instead of fine-tuning entire models for each application, simply swap in learned soft tokens that "prompt" the foundation model to behave differently—dramatically reducing deployment time and complexity.
Key Takeaways
- Invest in tokenization: "Tokenization decisions have profound impact in your model quality"—decisions at the bottom layer percolate through all downstream layers
- Use multi-token prediction: Predicting multiple future actions forces the model to optimize for long-term satisfaction, not just next-click
- Combine ID + semantic embeddings: Add semantic content features to handle cold start for unseen items
- Rich multi-task objectives: Predict entity ID, action type, duration, device—multiple facets create more robust representations
- Scaling laws apply to recommendations: Netflix validated that LLM scaling laws (more data, larger models) also apply to recommendation systems
- LLM techniques transfer effectively: Multi-token prediction, multi-layer representation, and long-context training all work for recommendations
- Richer than LLMs: Recommendation models can be richer than single-sequence LLM prediction with multiple output sequences and prediction targets
- Centralize representation learning: One foundation model eliminates duplication and increases leverage across all recommendation use cases
- Infrastructure consolidation: Foundation models enable unified data and representation layers, simplifying operations
- Three integration patterns: Plan your strategy upfront—subgraph integration, embeddings, or fine-tuning based on latency requirements
- Prompt tuning enables rapid adaptation: Train soft tokens to adapt the foundation model without retraining—faster deployment for new use cases
Video Reference
Netflix's Big Bet: One Model to Rule Recommendations
Yesu Feng, Netflix Personalization Engineering
Duration: ~18 min
Event: AI Engineer Summit 2025
Video ID: AbZ4IYGbfpQ
Key Timestamps
Note: All timestamps in this analysis link to the exact moment in the video where the quote or insight appears. Click any "Watch" link to jump directly to that section.