SEARCH-R1

SEARCH-R1

A retrieval-augmented reasoning system that uses Reinforcement Learning to train LLMs to interleave reasoning and search. The model learns when to search and what to search for through outcome-based rewards, without requiring demonstration data.

Core Innovation

Extends DeepSeek-R1’s pure RL approach to retrieval-augmented generation:

Model generates reasoning traces with search queries embedded
Search engine retrieves documents at each query
Retrieved content is inserted into the generation
Only final answer correctness provides reward

Trajectory Format

<think>
[Reasoning about the problem]
<search>query to search engine</search>
</think>
<information>
[Retrieved documents inserted by system]
</information>
<think>
[Further reasoning with retrieved info]
</think>
<answer>
[Final answer]
</answer>

Training Objective

Masked RL Loss

$L (θ) = - \sum_{t \in T_{model}} \hat{A}_{t} lo g π_{θ} (y_{t} ∣ y_{< t}, x)$

where $T_{model}$ excludes tokens inside <information> tags (retrieved content).

Why masking matters:

Retrieved content is not model-generated
Prevents memorization of corpus
Ensures gradients reflect reasoning quality, not retrieval

Key Design Choices

Choice	Recommendation	Rationale
Algorithm	PPO over GRPO	Value function absorbs retrieval noise; GRPO suffers reward collapse
Starting model	Base (not instruct)	More malleable to RL
Documents per search	3	Balance coverage vs noise
Reward	Outcome + format	Sparse but clean signal
Loss masking	Mask retrieved content	Critical for generalization

Comparison with Static RAG

Aspect	Static RAG	SEARCH-R1
Retrieval timing	Before generation	During reasoning
Query source	User query	Model-generated
Number of retrievals	Fixed	Adaptive
Training	SFT	RL
Multi-hop	Limited	Natural

Results

SEARCH-R1 significantly outperforms static RAG on multi-hop reasoning benchmarks (HotpotQA, 2WikiMultiHop, MuSiQue), showing the value of iterative search.

Connections

Builds on DeepSeek-R1 (pure RL for reasoning)
Improves upon Retrieval-Augmented Generation
Uses GRPO for optimization
Related to ReAct and Self-RAG approaches

Appears In

IR-L13 - RL for Reasoning and Search
Jin et al., “Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning” (2025)

Study Notes

Explorer

SEARCH-R1

SEARCH-R1

Core Innovation

Trajectory Format

Training Objective

Key Design Choices

Comparison with Static RAG

Results

Connections

Appears In

Graph View

Table of Contents

Backlinks