DeepSeek-R1
DeepSeek-R1
A large language model trained with pure reinforcement learning (no supervised fine-tuning on reasoning traces) that demonstrates emergent reasoning capabilities. Starting from a base model, it learns chain-of-thought reasoning solely through outcome-based reward signals.
Key Breakthrough
Pure RL without demonstrations: Unlike prior work that required SFT on human-written reasoning traces, DeepSeek-R1 shows that reasoning behaviors can emerge from RL training alone.
Emergent Behaviors
When trained with only final-answer correctness as reward:
- Extended thinking: Model produces longer reasoning chains
- Self-verification: “Let me check this calculation…”
- Backtracking: Recognizing and correcting errors mid-reasoning
- Multi-step decomposition: Breaking complex problems into parts
Why This Works
The reward signal creates selection pressure: trajectories with correct answers are reinforced. The model “discovers” that certain patterns (checking work, step-by-step reasoning) correlate with higher accuracy, so these behaviors are learned.
Training Details
| Component | Choice |
|---|---|
| Base model | DeepSeek-V3 (pre-trained, no SFT) |
| Algorithm | GRPO |
| Reward | Binary (correct = 1, incorrect = 0) |
| Demonstrations | None |
Significance
- Scalable: No need to collect expensive reasoning demonstrations
- Generalizable: Reasoning transfers across domains
- Foundation for agentic systems: Extended to SEARCH-R1 with retrieval
Connection to System 2 Thinking
DeepSeek-R1 exhibits “System 2” cognitive behavior:
- Slow, deliberate processing
- Explicit reasoning steps
- Self-monitoring and correction
This contrasts with “System 1” (fast, pattern-matching) typical of standard LLM inference.
Appears In
- IR-L13 - RL for Reasoning and Search
- DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” (2025)