LinkedIn AI Engineering

One Model to Rule Them All: How LinkedIn Replaced Dozens of Recommendation Systems with a Single LLM

Why build 50 specialized models when one foundation model can handle them all? The inside story of 360Brew—LinkedIn's revolutionary approach to personalized ranking that achieved 7x latency reduction and 30x throughput improvement.

What if we have only one model to solve all the tasks at the same time?

— Maziar, LinkedIn AI Team (00:01:37)

~15 min watch

Hamed & Maziar, LinkedIn AI

AI Engineer Summit 2025

The Problem: A Model Zoo That Couldn't Scale

One Model Per Task: A Broken Approach

LinkedIn's recommendation systems were disjoint, task-specific, and time-consuming. Each surface—the feed, job recommendations, search, ads—required its own specialized model trained from scratch and rolled out individually.

The reality: "These systems are usually trained on a specific task. So they are disjoint optimized, they are usually not leveraging the most advanced architecture, they are being rolled out one by one which is very time consuming and unproductive."

Watch explanation (00:01:11)

🐌

Slow Deployment

Each new surface needs data collection, model building, and production rollout

🔄

No Knowledge Sharing

Models don't learn from each other—feed insights don't help job recommendations

❄️

Cold Start Problem

New users with no history get poor recommendations from day one

The 360Brew Vision: One Foundation Model for All Personalization

The mission that we started was to build a large foundation model based on large language models that understand the holistic understanding of the user journey on LinkedIn platform and can solve all the personalization tasks that LinkedIn has with just one model.

— Maziar, LinkedIn AI Team (00:01:45)

Watch full vision statement

1. Zero-Shot Capability

When you have a new problem or new surface, instead of collecting data, building a new model, and putting it into production—a very time-consuming journey—you can leverage this model out of the box to solve your task.

Watch (00:02:09)

2. In-Context Learning for Cold Start

Leverage in-context learning as much as possible so that for a cold start users problem, we can leverage this model by just giving a very few examples or just by explaining what the user might be interested in.

Watch (00:02:44)

3. Natural Language Instructions

Give our users and members the ability to tell the model what they're interested in. Imagine next time you go to a LinkedIn feed, you can tell the model: "These are my niche interests and these are the topics that I'm interested to explore"—and the model starts finding relevant information.

Watch (00:03:00)

Promptification: The Magic That Makes It Work

This is what we call the magic of promptification. So we take all the information we have about the user history and their profiles and a lot of interactions that they have had and we turn it into a prompt.

— Maziar, LinkedIn AI Team (00:03:53)

Watch explanation

Prompt Structure

1. System Instruction: What we want the model to do (rank recommendations)

2. Member Profile: User information and demographics

3. Past Interactions: History with content already shown (views, clicks, saves)

4. New Item: The candidate item to rank

5. Question: "What do you think the user is going to do with this new piece of information?"

Why It Works: LLMs Are Trained on Language, Not Vectors

Traditional recommendation systems rely on dense vector embeddings and cosine similarity. But LLMs have been trained on massive amounts of natural language text—they understand context, intent, and world knowledge. By converting structured user data into natural language prompts, you tap into that pre-trained reasoning capability.

Key insight: The model doesn't just see user IDs and item IDs—it sees a story: "This user is a software engineer who recently viewed three Python tutorials and clicked on a job posting at a AI startup. They've now encountered a new post about LLM optimization. What will they do?"

Training: Three Levers for Performance

Taking an LLM off the shelf works "a little bit but it's not going to be perfect." To get production-ready results, LinkedIn identified three critical scaling factors and ran systematic experiments.

Lever 1: Data Scaling

"What if we have actually more and more data? And in this graph, as you see, as we increase the amount of data, the performance of the model actually improves."

Recommendation advantage: Unlike many NLP tasks, recommendation systems have rich historical data—6 months, a year, or more of user behavior. Feeding more context into the model directly improves predictions.

Watch data scaling results (00:06:54)

Lever 2: Model Size Scaling

"You can see if you go from 7B to 8x22B, the performance of the model actually increases and improves."

Architecture: Using Mixture-of-Experts (MoE) architecture with 8 experts at 22B parameters each, scaling from 7B to this larger model yielded measurable quality gains.

Watch model size results (00:07:30)

Lever 3: Context Length Scaling

"The context length actually matters a lot for these kinds of applications with the recommendation systems. And the context length actually defines how much history from the user you can actually give to the model."

The surprise: "As you can see towards the end of this graph the performance actually drops. We don't believe that this is because of the fact that the context is actually less informative. The problem is that the models—at least the model that we were using in this experiment—doesn't generalize that well to the longer context."

Key learning: More context helps—to a point. Models need to be trained on longer contexts to handle them effectively.

Watch context length analysis (00:07:52)

Promises Delivered: What 360Brew Actually Achieved

Result 1: Crushing Cold Start

"In this case we actually show the gap between our model and the production models on the users that have few interactions like for example less than five interactions, less than 100 interactions... And you can see the gap between the 360Brew model and the production model actually grows as the number of interactions decreases."

Why it matters: Traditional recommendation systems fail with new users. 360Brew's world knowledge from pre-training fills the gap, making better predictions with less data.

The graph showed: As user interactions decreased from 100+ to under 5, 360Brew's advantage over production models increased dramatically.

Watch cold start results (00:08:54)

Result 2: Zero-Shot Generalization

"Finally we were promising to give you some generalization to the new domains meaning that the problems that model has never seen inside its training. And in this graph as I show these are four different tasks and these tasks are completely out of domain. No information about that surface the model has seen during the training. But as you can see it can actually be on par or even beat some of the production models."

The significance: "And just to say these production models are specific for that specific task. So they have been trained on that task. So this is not actually a small feat. So it's actually something that is significant."

The results: On 4 completely new surfaces (domains the model had never seen during training), 360Brew performed on par with or better than task-specific production models.

Watch zero-shot results (00:09:42)

Result 3: Faster Feature Rollout

"So as you can see this gives the people who are developing these developing these platforms to roll out features and roll out surfaces much more quickly because they can actually use these models to do recommendation for them."

The impact: Instead of months of data collection, model training, and deployment for each new surface, teams can now use 360Brew out of the box. This dramatically accelerates innovation and iteration.

Production Deployment: "Go Big Then Go Small"

Our recipe is that we need to go big and then go small. If you go with a smaller model initially it doesn't have enough capacity, it doesn't have enough reasoning power to solve the complicated task that we have.

— Hamed, LinkedIn AI Team (00:11:27)

Watch explanation

❌ Wrong Approach

Train small model (3B) from scratch

"Doesn't work that well"—insufficient capacity and reasoning power

✅ Correct Approach

Step-by-step distillation from large teacher model

"Much, much, much more effective"—preserves knowledge while reducing size

The Distillation Recipe: Step-by-Step Size Reduction

150B

"The recipe here is that we need to do the distillation step by step and that means that we go with a for example 8B model then 3B model and then 1B model. So we slowly decrease the size of the model and we distill over and over from the previous model."

Watch (00:12:47)

1. Speculation & Pruning

Gradual pruning with distillation cycles between each step. "Gradual approach achieved near-zero information loss" vs. "Aggressive pruning = 1% quality drop"

Watch (00:14:00)

2. Mixed Precision Quantization

Critical finding: "For ranking recommendations and overall prediction tasks you want the model the prediction or the probability of output of the model to have a very good precision. So in the LM head at the end of the language model has to be in FP32."

"If you do it in FP16, BF16 or FP8, what happens is that the numbers collapse and you don't have a very good calibration on top of that and you cannot distinguish between different item recommended."

Recipe: FP8 for activations and parameters in middle layers, but FP32 for the LM head to maintain probability calibration.

Watch (00:15:10)

3. Attention Sparsification (Custom CUDA Kernels)

Multi-item scoring: "Can handle 50-500 items in one query efficiently" using special masked attention where "output items don't attend to each other"

Built custom CUDA kernels with SGLang and VLM frameworks to handle large batch scoring efficiently.

Production Results: The Numbers Speak

In four or five of our release, one release after the other, we were able to reduce the latency by 7x and at the same time increasing the throughput which is basically the number of queries that we can handle by one GPU by 30x.

— Hamed, LinkedIn AI Team (00:26:30)

Watch production results (00:26:30)

Latency Reduction

Through systematic optimization across 4-5 production releases

30x

Throughput Improvement

Number of queries per GPU increased dramatically

Serving at Scale: The Challenge

"Recommendation systems have tens of thousands of the QPS and they also require more less than a second like a 500 400 millisecond latency at at best."

Environment: LinkedIn's recommendation systems serve tens of thousands of queries per second with sub-500ms latency requirements. This is the environment where 360Brew had to prove itself—not just in offline benchmarks, but in real production traffic.

Watch serving constraints (00:10:47)

Key Takeaways

For ML Engineers

Promptification is powerful: Convert structured user data into natural language prompts to leverage LLM reasoning and world knowledge
Context length matters: More user history improves recommendations—but models must be trained to handle long contexts effectively
Go big then small: Start with a large teacher model (150B), then gradually distill step-by-step (150B→8B→3B→1B)
Mixed precision is essential: Use FP8 for most layers, but keep FP32 for the LM head to maintain ranking probability calibration

For Product Teams

Zero-shot enables speed: New surfaces and features don't need new models—use the foundation model out of the box
Cold start is solvable: LLM world knowledge bridges the gap for new users with minimal interaction history
Generalization beats specialization: One unified model can handle many tasks, often matching task-specific models
Natural language interfaces: Let users explicitly tell the system their interests for better personalization

For Engineering Leaders

Unified architecture scales: One model is easier to maintain, optimize, and improve than dozens of task-specific models
Faster iteration: Deploy features in hours, not months—no need to train new models for each surface
Cross-domain learning: Insights from one surface (e.g., feed) improve all others (jobs, search, ads)
Production optimization pays off: Systematic distillation, pruning, and quantization achieved 7x latency + 30x throughput improvements

Video Reference

360Brew: LLM-based Personalized Ranking and Recommendation

Hamed and Maziar, LinkedIn AI Team

LLM

Recommendation Systems

Production ML

360Brew

Watch Full Video

Duration: ~15 min
Event: AI Engineer Summit 2025
Video ID: U0S6CfzAY5c

Key Timestamps

00:01:37 - The core vision

00:03:53 - Promptification explained

00:07:52 - Context length analysis

00:08:54 - Cold start results

00:11:27 - "Go big then small"

00:26:30 - Production results

Note: All timestamps in this analysis link to the exact moment in the video where the quote or insight appears. Click any "Watch" link to jump directly to that section.