SEARCH-R1

SEARCH-R1

A retrieval-augmented reasoning system that uses Reinforcement Learning to train LLMs to interleave reasoning and search. The model learns when to search and what to search for through outcome-based rewards, without requiring demonstration data.

Core Innovation

Extends DeepSeek-R1’s pure RL approach to retrieval-augmented generation:

  • Model generates reasoning traces with search queries embedded
  • Search engine retrieves documents at each query
  • Retrieved content is inserted into the generation
  • Only final answer correctness provides reward

Trajectory Format

<think>
[Reasoning about the problem]
<search>query to search engine</search>
</think>
<information>
[Retrieved documents inserted by system]
</information>
<think>
[Further reasoning with retrieved info]
</think>
<answer>
[Final answer]
</answer>

Training Objective

Masked RL Loss

where excludes tokens inside <information> tags (retrieved content).

Why masking matters:

  • Retrieved content is not model-generated
  • Prevents memorization of corpus
  • Ensures gradients reflect reasoning quality, not retrieval

Key Design Choices

ChoiceRecommendationRationale
AlgorithmPPO over GRPOValue function absorbs retrieval noise; GRPO suffers reward collapse
Starting modelBase (not instruct)More malleable to RL
Documents per search3Balance coverage vs noise
RewardOutcome + formatSparse but clean signal
Loss maskingMask retrieved contentCritical for generalization

Comparison with Static RAG

AspectStatic RAGSEARCH-R1
Retrieval timingBefore generationDuring reasoning
Query sourceUser queryModel-generated
Number of retrievalsFixedAdaptive
TrainingSFTRL
Multi-hopLimitedNatural

Results

SEARCH-R1 significantly outperforms static RAG on multi-hop reasoning benchmarks (HotpotQA, 2WikiMultiHop, MuSiQue), showing the value of iterative search.

Connections

Appears In