Notes: DeepSeek-R1

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

3 Major Areas of Work

Trains DeepSeek-R1-Zero by just using pure Reinforcement Learning, applied directly to the base model (without any SFT and supervised data). The base model in this case is DeepSeek-V3.
Trains DeepSeek-R1, by applying RL with cold start data (base model fine-tuned on 1000s of COT reasoning steps).
Distilling the reasoning abilities to small models like Qwen/Llama (using 800k training samples of Reasoning Trajectories from DeepSeek-R1).

DeepSeek's Approach (Pure RL)

Earlier RL has been used on LLM Reasoning tasks, but those approaches heavily used supervised data.
DeepSeek's approach aims to develop reasoning abilities without any supervised data.
Uses an RL algorithm called GRPO (Group Relative Policy Optimization).
Focuses on self-evolution through a pure reinforcement learning process.
DeepSeek-R1-Zero used pure rule-based rewards (Accuracy + Format Correctness) — this was interesting to know.

DeepSeek-R1 also uses RL but with cold start data, in order to solve limitations of DeepSeek-R1-Zero (i.e. language mixing, poor readability of the output).
DeepSeek-R1 checkpoint is used to collect 800k reasoning samples + 200k non-reasoning samples.
800K samples collected using DeepSeek-R1 are used to distill small models (Qwen/Llama).
Distillation only uses SFT (no RL) but the idea of using RL for distillation is kept open for future work.
They also try to perform the R1-Zero RL method on small Qwen-32B but the performance was less than Qwen32B-Distilled from the DeepSeek-R1 teacher model — making RL inefficient for smaller models, as it may not achieve the performance of distillation.
Future involves building a more powerful base model coupled with large-scale Reinforcement Learning.

RL and GRPO (Group Relative Policy Optimization)

In RL, the key lies in defining the reward function. Based on the reward function the objective function is maximized, which is further used to calculate the policy gradient. The weights of the model are then updated using gradient descent.

Group-based baseline: DeepSeek team chose to optimize policies in RL efficiently by group-based sampling. (Baseline means reference point for comparing rewards.)
Unlike traditional policy gradient algorithms, it does not use a Critic Model, which makes it computationally efficient. A critic model tells how good the outcome is — usually a separate neural network that estimates the advantage function (making it computationally expensive).
Instead of using a critic model to estimate the advantage function, GRPO uses group-based sampling to compare the rewards of multiple responses generated by the model.
GRPO is suited for large-scale reinforcement learning tasks (like training LLMs) because it eliminates the overhead of training a critic model.
The policy is updated using a clipped objective for stability and the KL divergence ensures the new policy does not diverge significantly from a reference policy. (Stable Training)

Working Mechanism of GRPO

Step 1: Sampling Group of Outputs

For a given input question q, the algorithm samples a group of outputs {o1, o2, …, oG} from the current policy πθold (the model's behavior before the update).
Each output oi is a possible response generated by the model for the question q.

Step 2: Calculating Rewards

Each output oi is evaluated using a reward function ri. The reward function measures how good the output is, based on criteria like correctness (e.g., for math problems) or adherence to a specific format (e.g., putting reasoning steps between <think> and </think> tags).

The reward function in DeepSeek is very interesting — it just checks 2 things: Answer Accuracy and Format Structure Accuracy. Based on these 2 checks the model is able to develop emergent reasoning ability.

The rewards for the group of outputs are {r1, r2, …, rG}.

The reward signals are used to update the model's weights by providing feedback on the quality of its responses.

Step 3: Calculating Advantages

The advantage $A_i$ for each output $o_i$ is calculated relative to the group's average reward. This measures how much better or worse the output is compared to the average performance of the group:

A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \cdots, r_G\})}{\text{std}(\{r_1, r_2, \cdots, r_G\})}

Step 4: Policy Optimization

The policy is updated to maximize the following objective function:

\mathcal{J}_{GRPO}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{old}}(O|q)]

\frac{1}{G} \sum_{i=1}^{G} \left( \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon \right) A_i \right) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{ref}) \right)

Where the KL divergence is defined as:

\mathbb{D}_{KL}(\pi_\theta || \pi_{ref}) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1

$\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}$ is the importance weight, which measures how much the new policy $\pi_\theta$ differs from the old policy $\pi_{\theta_{old}}$ .
The clip function ensures the importance weight does not deviate too much from 1, preventing large updates that could destabilize training.
$\mathbb{D}_{KL}(\pi_\theta || \pi_{ref})$ is the KL divergence between the new policy $\pi_\theta$ and a reference policy $\pi_{ref}$ (often the initial policy). This acts as a regularizer to prevent the new policy from deviating too much from the reference policy.
$\varepsilon$ and $\beta$ are hyper-parameters that control the clipping range and the strength of the KL regularization, respectively.

An Aha Moment

Researchers observe that GRPO enables the model to develop complex reasoning behaviors, such as self-verification and reflection, without explicit supervision. All these emergent abilities developed through pure RL. They named it "an aha moment" — as the model itself has an aha moment in its reasoning process.

Here's the actual example from the paper (Table 3) — an intermediate version of DeepSeek-R1-Zero solving a math problem:

Question: If $a > 1$ , then the sum of the real solutions of $\sqrt{a - \sqrt{a + x}} = x$ is equal to

Response: <think> To solve the equation $\sqrt{a - \sqrt{a + x}} = x$ , let's start by squaring both... $(\sqrt{a - \sqrt{a+x}})^2 = x^2 \implies a - \sqrt{a+x} = x^2$ . Rearrange to isolate the inner square root term: $(a - x^2)^2 = a + x \implies a^2 - 2ax^2 + (x^2)^2 = a + x \implies x^4 - 2ax^2 - x + (a^2 - a) = 0$ ... Wait, wait. Wait. That's an aha moment I can flag here. Let's reevaluate this step-by-step to identify if the correct sum can be... We started with the equation: $\sqrt{a - \sqrt{a+x}} = x$ First, let's square both sides: $a - \sqrt{a+x} = x^2 \implies \sqrt{a+x} = a - x^2$ Next, I could square both sides again, treating the equation...

Table 3 | An interesting "aha moment" of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.

Unsuccessful Attempts

DeepSeek Team also tried other approaches but faced setbacks:

Process Reward Model
Monte Carlo Tree Search: Same as AlphaGo and AlphaZero — this approach has some potential, but large-scale training would require a pre-trained critic-model/value-model.

Limitations of DeepSeek-R1 and Future Work

Not better than DeepSeek-V3 in function calling, multi-turn conversation, JSON output, and role playing.
R1 is only optimized for Chinese and English. Often mixes up language when querying in other languages.
Sensitive to prompts: Few-shot prompting degrades the performance. It is recommended to simply describe the problem + zero-shot output examples.
Not better than DeepSeek-V3 in software engineering tasks. They plan to include rejection sampling on software engineering data to improve this ability in the future.

@int2float