IR-L07 - Learned Sparse Retrieval

Overview

Learned Sparse Retrieval (LSR) represents a hybrid approach in Information Retrieval that bridges the gap between traditional sparse retrieval (like BM25) and modern neural dense retrieval (like DPR). LSR projects queries and documents into high-dimensional sparse vectors over a fixed vocabulary (usually the one from a Pretrained Language Model like BERT).

Key advantages:

Efficiency: Compatible with existing inverted index infrastructures.
Effectiveness: Utilizes neural architectures (Transformers) to learn importance weights and expand text with semantically relevant terms.
Interpretability: Unlike dense vectors, LSR dimensions often map directly to vocabulary tokens.

1. Motivation: Why Sparse, why Learned, why Now?

The Traditional vs. Neural Gap

Traditional Sparse (BM25): Fast and scalable due to inverted indices, but suffers from the vocabulary mismatch problem (cannot find relevant documents that use different synonyms).
Dense Retrieval (Neural): High effectiveness by capturing semantics in continuous latent space, but requires specialized vector databases (ANN) and significant computational resources.

Bridging the Gap

LSR aims to achieve neural-level effectiveness while maintaining the efficiency of sparse indices.

Learned Sparse Retrieval (LSR)

LSR uses a query encoder $E_{Q}$ and a document encoder $E_{D}$ to project queries $q$ and documents $d$ into sparse vectors over a fixed vocabulary $V$ .

The relevance score is computed as a sparse dot product: $score (q, d) = \sum_{j = 1}^{∣ V ∣} E_{Q} (q)_{j} \cdot E_{D} (d)_{j} = E_{Q} (q)^{T} E_{D} (d)$

where:

$E_{Q} (q)_{j}$ — learned weight of the $j$ -th term in the query.

$E_{D} (d)_{j}$ — learned weight of the $j$ -th term in the document.

2. A First Attempt: SNRM

Standalone Neural Ranking Model (SNRM) (2018) was the first neural retrieval model to learn high-dimensional sparse representations for inverted indexing.

Non-grounded: Dimensions were latent (did not map to actual words).
Architecture: Used a sliding window over token embeddings followed by average pooling.
Sparsity: Enforced through $L_{1}$ regularization.
Legacy: Proved that neural models could be used for first-stage retrieval without a lexical ranker.

3. Grounding and Expansion

Modern LSR models are grounded, meaning each activated dimension corresponds to a real token in the vocabulary.

3.1 Document Expansion (doc2query)

To solve the vocabulary mismatch problem, documents are expanded with terms they don’t explicitly contain but are semantically relevant.

doc2query / docTTTTTquery: Uses a sequence-to-sequence model (like T5) to predict potential queries a document could answer. These predicted queries are appended to the document before standard indexing.
Effect: Increases recall by providing “more entry points” for queries.

3.2 Term Importance Prediction

Standard TF-IDF assumes importance based on frequency. LSR learns to predict “Term Impact” directly.

DeepCT: Uses BERT to predict term importance (contextual term weighting). It maps BERT outputs to a value that replaces the traditional “Term Frequency” (TF) in the BM25 formula.
DeepImpact: Learns impact scores directly using a neural model, storing them in the inverted index.

3.3 uniCOIL

uniCOIL combines term weighting and expansion. It uses a BERT-based encoder to weight tokens and relies on doc2query for expansion. It simplifies the scoring to a sum of weighted overlaps: $score (q, d) = \sum_{t \in q \cap d} w_{q, t} \cdot w_{d, t}$

4. Sparse Representation Learning: SPLADE

Sparsified Lexical and Expansion (SPLADE) is a state-of-the-art LSR family.

4.1 Architecture

SPLADE leverages the Masked Language Modeling (MLM) head of BERT.

For each token $i$ in the input, get the vocabulary-wide distribution from the MLM head.
For each term $j$ in the vocabulary $V$ , compute the importance $w_{ij}$ .
Aggregate these across all input tokens (usually using a max operation) to get the final vector $s$ : $s_{j} = max_{i \in input} lo g (1 + relu (w_{ij}))$

Why MLM works for expansion?

MLM is pre-trained to predict missing tokens based on context. In SPLADE, we don’t mask anything; we just use the MLM head to ask: “Given this context, what other words are semantically plausible here?” This naturally leads to expansion terms.

4.2 Training Objectives

Ranking Loss: Contrastive loss (e.g., InfoNCE or MarginMSE) to ensure relevant documents score higher than non-relevant ones.
Distillation: Often trained using a Cross-Encoder (Teacher) to guide the LSR model (Student).

4.3 Sparsity via FLOPs Regularization

To ensure the vectors are actually sparse (and thus efficient), a regularization term is added to the loss: $L = L_{r ankin g} + λ L_{re g}$

Instead of simple $L_{1}$ , SPLADE often uses FLOPs regularization, which minimizes the expected number of operations during retrieval: $L_{F L OP s} = \sum_{j \in V} (\overset{a}{ˉ}_{j})^{2}$ where $\overset{a}{ˉ}_{j}$ is the average activation of term $j$ in a batch.

5. “Wacky” Weights and Interpretability

Recent research (Mackenzie et al., 2021) shows that LSR weights can be counter-intuitive:

“Wacky” expansion: Models might give high weights to stopwords (like the, is) or punctuation (,).
Reason: Neural training exploits any signal that correlates with relevance. If definitional documents always contain a comma after the term, the model learns to weight the comma.
Takeaway: LSR is optimized for effectiveness, not necessarily for human-readable semantic meaning.

6. Comparison: Sparse vs. Dense vs. Hybrid

Feature	BM25	Dense (e.g. DPR)	LSR (e.g. SPLADE)
Representation	Term counts	Continuous latent	Learned weights + expansion
Storage	Inverted Index (Small)	Vector Index (Large)	Inverted Index (Medium)
Latency	Very Low	High (requires ANN)	Low
Vocabulary Gap	High	Low	Low
Interpretability	High	Low	Medium

Hybrid Retrieval

Many production systems use Hybrid Search, combining BM25 and Dense/LSR scores through Reciprocal Rank Fusion (RRF) to get the “best of both worlds.”

Key Takeaways

LSR projects text into sparse, vocabulary-aligned neural representations.
SPLADE is the dominant architecture, utilizing BERT’s MLM head for both weighting and expansion.
Regularization (like FLOPs) is essential to keep the index efficient.
Efficiency is the main selling point: we get neural performance using the same “search engine” tech (Lucene/Elasticsearch) we’ve used for decades.

Related Concepts:

Study Notes

Explorer

IR-L07 - Learned Sparse Retrieval

IR-L07 - Learned Sparse Retrieval

Overview

1. Motivation: Why Sparse, why Learned, why Now?

The Traditional vs. Neural Gap

Bridging the Gap

2. A First Attempt: SNRM

3. Grounding and Expansion

3.1 Document Expansion (doc2query)

3.2 Term Importance Prediction

3.3 uniCOIL

4. Sparse Representation Learning: SPLADE

4.1 Architecture

4.2 Training Objectives

4.3 Sparsity via FLOPs Regularization

5. “Wacky” Weights and Interpretability

6. Comparison: Sparse vs. Dense vs. Hybrid

Key Takeaways

Graph View

Table of Contents

Backlinks