Understanding DeepSeek-R1 paper: Beginner’s guide
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
So, by now you’ve heard of DeepSeek making huge waves in the Generative AI space. But more than the model, it’s the research paper that has wreaked havoc and even led the US stock market to crash.
In this long post, I will be explaining every detail of the paper on DeepSeek-R1 in layman’s terms so that you don’t have a FOMO.
This is a long post, so have your coffee ready
Before getting started, you need to know the following terms
What is Reinforcement Learning?
Reinforcement learning (RL) is a type of machine learning where an AI learns by taking actions and receiving rewards or penalties based on those actions. The goal is to maximize rewards over time.
Example: Imagine teaching a robot to play a game. The robot tries different moves, and for each good move (like scoring a point), it gets a reward (like +1). For bad moves (like losing a point), it gets a penalty (like -1). Over time, the robot learns which moves lead to the highest score and becomes better at playing the game.
Learn the basic concepts from the blog below
What is SFT Fine-Tuning?
Fine-tuning a model is the process of taking a pre-trained AI model and making small adjustments to it so that it performs better on a specific task. Instead of training from scratch, the model is “tuned” with additional data to improve its performance for a particular use case.
SFT (Supervised Fine-Tuning) is a specific type of fine-tuning where the model is trained on a labeled dataset. This means the model is provided with examples that include both the input data (like images or text) and the correct answers (labels). The model learns to make predictions based on these labeled examples to improve its accuracy for a particular task.
What is Knowledge Distillation?
Model distillation is a method where knowledge from a large, complex model (the teacher model) is transferred to a smaller, simpler model (the student model).
The aim is to develop a more compact model that maintains much of the performance of the larger model, but with improved efficiency in terms of computational power, memory usage, and speed during inference.
Now you are ready to jump on to the paper detailed discussion (section by section)
INTRODUCTION
Large Language Models (LLMs) have been improving rapidly, making them closer to Artificial General Intelligence (AGI) — the kind of AI that can think and reason like humans.
One of the biggest improvements in recent years is post-training — a step done after the initial model training. This helps LLMs:
Think better (improving reasoning skills).
Align with human values (reducing harmful outputs).
Personalize responses based on user preferences.
Do all this without using as much computing power as training from scratch.
A breakthrough came with OpenAI’s o1 models, which extended the reasoning process at inference time (when the model is generating responses). This means the model takes more time to think before answering, which significantly improves its performance on tasks like Maths, Coding, Scientific reasoning
However, scaling this reasoning ability effectively during real-time use (test-time scaling) is still an open challenge.
Researchers have tried different methods to enhance reasoning, including:
Reward models (evaluating how good a response is).
Reinforcement learning (RL) (teaching the model through trial and error).
Search algorithms (Monte Carlo Tree Search, Beam Search, etc.).
So far, none of these methods have matched OpenAI’s o1 models in reasoning.
What This Paper Introduces
The paper explores a new way to improve reasoning using pure reinforcement learning (RL) — meaning no supervised data (human-labeled examples). Instead, the model learns by itself through an RL framework called GRPO (we will discuss this in some depth).
Using DeepSeek-V3-Base as the foundation, they trained a model called DeepSeek-R1-Zero. Over thousands of RL steps, the model:
Developed powerful reasoning skills.
Improved AIME 2024 benchmark score from 15.6% → 71.0% (and even 86.7% with majority voting)
Matched the reasoning ability of OpenAI-o1–0912.
However, DeepSeek-R1-Zero had some problems:
Poor readability.
Language mixing (struggled with keeping responses consistent).
To fix these issues, they introduced DeepSeek-R1, which combines:
Cold-start fine-tuning (training with a small amount of labeled data).
Reinforcement learning focused on reasoning.
Supervised fine-tuning (SFT) using high-quality human-labeled data.
After these steps, DeepSeek-R1 matched OpenAI-o1–1217 in reasoning.
Final Contribution: Model Distillation
They also distilled DeepSeek-R1 into smaller models (like Qwen2.5–32B), proving that:
Larger models learn better reasoning patterns.
Smaller models can inherit this knowledge without needing complex RL training.
Their 14B distilled model even outperformed the best open-source models, setting new benchmarks in reasoning for dense models.
Hence,
DeepSeek released 2 main models, DeepSeek-R1 and DeepSeek-R1-Zero
They also released some distilled versions of DeepSeek, more for deployment purposes
The major discovery is using Reinforcement Learning directly for improving reasoning.
APPROACH
Before you jump ahead, try going through this post to understand a crucial Reinforcement Learning algorithm used in DeepSeek’s training i.e. GRPO Reinforcement Learning algorithm
The approach used to train DeepSeek involves a novel reinforcement learning (RL) framework that significantly enhances reasoning capabilities without relying heavily on supervised fine-tuning (SFT). The training process is divided into two main variants: DeepSeek-R1-Zero and DeepSeek-R1, followed by distillation into smaller models.
1. DeepSeek-R1-Zero: Pure Reinforcement Learning
- Objective: Train the base model using pure RL without any supervised fine-tuning (SFT) data.
- Algorithm: Uses Group Relative Policy Optimization (GRPO). For each question, GRPO samples a group of outputs computes rewards and optimizes the policy using a clipped objective with a KL divergence constraint to ensure stable updates.
Note : I’ve explained GRPO in some details in the above post, you can check it out
Reward System:
- Accuracy Rewards: Rule-based rewards for correct answers (e.g., math problems with deterministic results).
- Format Rewards: Ensures the model structures its reasoning process within
<think>
and</think>
tags. - Self-Evolution: The model autonomously improves its reasoning capabilities over time, demonstrating behaviours like reflection and alternative problem-solving strategies without explicit programming.
2. DeepSeek-R1: Reinforcement Learning with Cold Start
- Objective: Enhance reasoning performance and readability by incorporating a small amount of high-quality cold-start data.
Cold-start data refers to a small amount of high-quality, supervised data used to initialize or “kickstart” the training of a machine learning model, particularly in scenarios where the model is being trained from scratch or transitioning to a new task.
It acts as a “seed” to initialize the model, enabling it to start with a basic understanding of the task and ensuring a smoother and more effective reinforcement learning process.
- Cold Start: Fine-tune the base model with thousands of long Chain-of-Thought (CoT) examples to improve readability and reasoning quality.
- RL Training: Apply GRPO to the fine-tuned model, focusing on reasoning-intensive tasks (e.g., math, coding, logic). A language consistency reward is introduced to reduce language mixing and improve readability.
- Rejection Sampling and SFT: After RL converges, collect high-quality reasoning and non-reasoning data (e.g., writing, role-playing) through rejection sampling and fine-tune the model for general-purpose tasks using this high-quality data.
Rejection sampling is a technique used to generate high-quality data by filtering out low-quality or incorrect outputs from a model. Here’s how it works:
Process:
1.For a given input (e.g., a reasoning question), the model generates multiple responses.
2.Each response is evaluated using a reward function or rule-based criteria (e.g., correctness, readability, or alignment with human preferences).
3.Only the best responses (e.g., those with the highest rewards or meeting specific criteria) are retained, while the rest are rejected.
2nd RL Stage: Perform a second RL stage to align the model with human preferences, improving helpfulness and harmlessness while maintaining strong reasoning capabilities.
Why 2nd round of RL training happen for DeepSeek-R1?
The second round of Reinforcement Learning (RL) training for DeepSeek-R1 was conducted to further refine the model’s performance and align it with human preferences. Here’s why it was necessary:
After the initial RL training, the model was already strong in reasoning tasks (e.g., math, coding, logic).The first RL stage primarily used rule-based rewards (e.g., accuracy in math problems).
However, it needed to improve in general-purpose tasks like writing, role-playing, and factual question-answering.
The second RL stage focused on broadening the model’s capabilities beyond reasoning, making it more versatile and useful in diverse scenarios. In the second stage, reward models were introduced to capture human preferences in complex and nuanced scenarios (e.g., helpfulness, harmlessness, coherence).
3. Distillation: Transferring Reasoning to Smaller Models
Objective: Distill the reasoning capabilities of DeepSeek-R1 into smaller, more efficient models.
Method: Fine-tune open-source models (e.g., Qwen, Llama) using the curated dataset from DeepSeek-R1.
Results: Significant improvements in the reasoning abilities of smaller models, demonstrating the effectiveness of distillation.
Metrics and Performance
The models, especially DeepSeek-R1 have beaten some SOTA LLMs on various benchmarks. As we’ve already discussed, will skip this part
Discussions
This section of the paper talks about some unsuccessfully attempts and how distillation is the way to go for scalable solutions:
- Distillation is superior to RL for smaller models: Distilling knowledge from larger models is more effective and economical than training smaller models with large-scale RL.
- RL is resource-intensive but may be necessary for breakthroughs: While distillation is effective, advancing beyond current limits may still require more powerful base models and large-scale RL.
- PRM and MCTS face significant challenges: Both PRM (Process Rwaerd Model in RL) and MCTS (Monte Carlo Tree Search, RL) showed potential but were ultimately limited by scalability, computational overhead, and the complexity of token generation.
CONCLUSION
Enhanced Reasoning via RL:
- DeepSeek-R1-Zero uses pure RL (GRPO) without cold-start data, achieving strong performance.
- DeepSeek-R1 leverages cold-start data + RL fine-tuning, matching OpenAI-o1–1217’s performance.
Distillation Success:
- DeepSeek-R1’s reasoning capabilities were distilled into smaller models (e.g., Qwen-1.5B), outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks.
Future Research Directions:
- General Capability: Improve performance in function calling, multi-turn interactions, and complex tasks.
- Language Mixing: Address language mixing issues for non-English/Chinese queries.
- Prompting Engineering: Optimize for zero-shot prompts to enhance performance.
- Software Engineering: Apply rejection sampling or async evaluations to improve efficiency in software-related tasks.
How to use DeepSeek R1?
Hope this summary is helpful