Netflix AI Engineering

One Model to Rule Them All: How Netflix Built a Foundation Model for All Recommendations

Why build dozens of specialized recommendation models when one foundation model can handle them all? The inside story of how Netflix proved scaling laws apply to recommendation systems—achieving both quality improvements and infrastructure consolidation by borrowing LLM techniques like multi-token prediction and long-context training.

Can we centralize the learning of user representation in one place? The scaling law also applies to recommendation system as it applies to LLM.

— Yesu Feng, Netflix (00:04:18)

~18 min watch
Yesu Feng, Netflix Personalization
AI Engineer Summit 2025

The Problem: Fragmentation at Scale

Dozens of Independent Models

Netflix had grown dozens of specialized recommendation models built independently over years. Each model had different objectives but massive overlap in features and label engineering—leading to duplications and an unmanageable landscape.

The reality: "Naturally this lead to duplications in our label engineering as well as feature engineering... many of those models were built independently over the years. They may have different objectives but have a lot of overlaps as well."

Watch explanation (00:02:26)

Three-Level Diversity Challenge

Netflix's recommendation complexity spans three dimensions: diverse rows (genres, trending, originals), diverse items within rows, and diverse pages (homepage, search, kids).

Quote: "Diversity comes at at least three levels. There is first level about row. We have diverse rows... Second dimension is of course of the items or entities... The third level is page right so we have homepage we have search page we have a kids homepage which is tailored very differently toward kids interest"

Watch (00:00:47)
🔄

Massive Duplication

Overlapping features and labels across dozens of independently-built models

📈

Unmanageable Scale

Can't spin up new models for each content type or business use case

🧊

Cold Start Problem

Titles the model hasn't seen during training can't be handled at inference time

The Foundation Model Hypothesis

Can we centralize the learning of user representation in one place? One hypothesis is that through scale up semi-supervised learning personalization can be improved. The scaling law also applies to recommendation system as it applies to LLM.

— Yesu Feng, Netflix (00:04:18)

Watch hypothesis explanation

The Core Thesis

Netflix bet that by building a single transformer-based foundation model and applying LLM techniques—multi-token prediction, multi-layer representation, and long-context training—they could achieve better quality while consolidating infrastructure.

Validation: Over 2.5 years of scaling from millions of profiles to billions of parameters, Netflix "constantly still see the gain"—confirming that LLM scaling laws apply to recommendation systems.

Tokenization: The Foundation That Matters Most

People work with LLM understand tokenization decisions have profound impact in your model quality. So although it's the bottom layer the decision you made there can percolate through all the downstream layers and manifest as either your model quality problem or model quality plus.

— Yesu Feng, Netflix (00:05:44)

Watch tokenization deep dive

Event Representation: When, Where, What

Netflix breaks user interactions into three components for rich context:

WHEN

Time Encoding

When the event happened (temporal patterns)

WHERE

Location & Context

Physical location, device type, canvas (row/page)

WHAT

The Action

Interaction type, duration, entity ID

Watch event representation explanation (00:07:40)

The Co-Star Problem: Cold Start Solved

Pure ID embeddings fail on unseen content. Netflix combines ID embeddings with semantic content information to handle new titles at inference time.

Problem: "If you only have ID embedding learn from scratch in the model then you have problem with co-star meaning that titles the model hasn't seen during training it doesn't know how to deal with it at inference time."

Solution: "So we need to have semantic content information to be complementaryary to those ID embeddings."

Watch (00:08:34)

Richer Objectives Than LLMs

This is also very interesting in the sense that it's much richer than LLM because you can see first we use instead of one sequence but multiple sequence to represent the output... we have many other facets of field of each event that can be also used as a target right so it could be for things like action type.

— Yesu Feng, Netflix (00:10:07)

Watch rich objectives explanation

Multi-Task Learning: Multiple Prediction Targets

Unlike LLMs that predict the next token in a single sequence, Netflix's foundation model predicts multiple facets simultaneously:

Entity ID

Which item (top-k softmax)

Action Type

View, click, save, share

Duration

Watch time, engagement depth

Device & Timing

Platform, temporal patterns

Key insight: Multi-task learning with diverse prediction targets creates more robust user representations that generalize better than single-objective models.

LLM Techniques Applied to Recommendations

1. Multi-Token Prediction for Long-Term Satisfaction

Instead of predicting just the next action, Netflix forces the model to predict multiple future actions—making it less myopic and more focused on long-term user satisfaction.

The goal: "Force the model to be less myopic more robust to serving time shift because you have a time gap between your training and serving and also force the model to targets long-term user satisfaction and long-term user behavior instead of just focus on next action."

Watch multi-token prediction (00:12:48)

2. Multi-Layer Representation Extraction

Netflix uses the foundation model as a feature extractor, pulling representations from multiple transformer layers rather than just the final output—similar to BERT embeddings in NLP.

Key insight: Intermediate layers often contain richer, more generalizable representations that transfer better to downstream tasks than the final output layer.

3. Long-Context User Histories

Netflix feeds extensive user interaction sequences into the model, evolving from truncated sliding windows to sparse attention to progressive training for longer sequences.

Evolution: Truncated sliding window → Sparse attention → Progressively training longer and longer sequences. Long context windows are critical for capturing user preferences and behavior patterns over time.

Scaling Laws Confirmed

Does a scaling law apply and I think the answer is yes. So this is over the roughly two to two and a half years we were scaling up and we constantly still see the gain from only on the order of 10 million profile or a few million profile to now on the order of 1 billion model parameters.

— Yesu Feng, Netflix (00:11:35)

Watch scaling law validation (00:11:35)
2.5 Years

Scaling Duration

Continuous improvement from millions to billions of parameters

~1B

Model Parameters

Scaled from ~10M to ~1B parameters with consistent quality gains

The Scaling Law Applied to Recommendations

Netflix validated that the same scaling laws that apply to LLMs—more data, larger models, longer training—also apply to recommendation systems. This was a 2.5-year journey of continuous scaling with "constant gains" at each step.

Practical implication: Don't underestimate the value of scale in recommendation systems. Unlike smaller models that plateau, foundation models continue to improve with more parameters and training data—just like LLMs.

Three Consumption Patterns

There are three main approaches or consumption patterns. First, foundation model can be integrated as a subgraph within the downstream model. Second, we can push out embeddings. Finally, user can extract the models and fine-tune it for specific applications.

— Yesu Feng, Netflix (00:14:41)

Watch consumption patterns (00:14:41)

Pattern 1: Subgraph Integration

Foundation model integrated as a subgraph within downstream models for direct feature extraction during inference

Pattern 2: Push Out Embeddings

Pre-compute and store user/content embeddings in a centralized feature store for fast retrieval at serving time

Pattern 3: Fine-Tune / Distill

Extract and fine-tune or distill the foundation model for specific applications with stringent latency requirements

Results: High Leverage Achieved

So we see we indeed see high leverage of FM to bring about both AB test wins as well as infrastructure consolidation. It is a scalable solution in terms of both scale up the model with improved quality as well as make the whole infra consolidated and scale.

— Yesu Feng, Netflix (00:16:36)

Watch results summary (00:16:36)

AB Test Wins

Multiple AB tests showing quality improvements across applications

Infrastructure Consolidation

Unified data and representation layer across all recommendation use cases

Scalable Quality

Continuous improvement by scaling up the model with more parameters

Simplified Operations

One model to maintain instead of dozens of fragmented systems

Future Directions: Prompt Tuning

Can we just train some soft tokens so that at inference time we can directly swap in and out the soft tokens to prompt the FM to behave differently. So that is also a very promising direction that we are getting into.

— Yesu Feng, Netflix (00:18:04)

Watch future directions (00:18:04)

Prompt Tuning for Rapid Adaptation

Netflix is exploring prompt tuning—training soft tokens that can be swapped at inference time to adapt the foundation model to different tasks without retraining. This enables rapid adaptation for new use cases while maintaining the benefits of a centralized foundation model.

The promise: Instead of fine-tuning entire models for each application, simply swap in learned soft tokens that "prompt" the foundation model to behave differently—dramatically reducing deployment time and complexity.

Key Takeaways

For ML Engineers
  • Invest in tokenization: "Tokenization decisions have profound impact in your model quality"—decisions at the bottom layer percolate through all downstream layers
  • Use multi-token prediction: Predicting multiple future actions forces the model to optimize for long-term satisfaction, not just next-click
  • Combine ID + semantic embeddings: Add semantic content features to handle cold start for unseen items
  • Rich multi-task objectives: Predict entity ID, action type, duration, device—multiple facets create more robust representations
For Researchers
  • Scaling laws apply to recommendations: Netflix validated that LLM scaling laws (more data, larger models) also apply to recommendation systems
  • LLM techniques transfer effectively: Multi-token prediction, multi-layer representation, and long-context training all work for recommendations
  • Richer than LLMs: Recommendation models can be richer than single-sequence LLM prediction with multiple output sequences and prediction targets
For Engineering Leaders
  • Centralize representation learning: One foundation model eliminates duplication and increases leverage across all recommendation use cases
  • Infrastructure consolidation: Foundation models enable unified data and representation layers, simplifying operations
  • Three integration patterns: Plan your strategy upfront—subgraph integration, embeddings, or fine-tuning based on latency requirements
  • Prompt tuning enables rapid adaptation: Train soft tokens to adapt the foundation model without retraining—faster deployment for new use cases

Video Reference

Netflix's Big Bet: One Model to Rule Recommendations

Yesu Feng, Netflix Personalization Engineering

Netflix
Foundation Model
Recommendation Systems
Scaling Laws
LLM Techniques
Watch Full Video

Duration: ~18 min
Event: AI Engineer Summit 2025
Video ID: AbZ4IYGbfpQ

Note: All timestamps in this analysis link to the exact moment in the video where the quote or insight appears. Click any "Watch" link to jump directly to that section.

Analysis based on Yesu Feng's talk at AI Engineer Summit 2025. All insights extracted and verified with YouTube timestamps.