IR-PTR Chapter 5: Dense Retrieval and Learned Sparse Retrieval

5.1 Overview: The Dense Retrieval Paradigm

This chapter explores the shift from exact-match Information Retrieval (sparse) to Dense Retrieval, where queries and documents are mapped into a shared low-dimensional continuous vector space.

Dense Retrieval

A retrieval paradigm where both the query $q$ and document $d$ are represented as dense vectors $h_{q}, h_{d} \in R^{D}$ (typically $D = 768$ for BERT-base). Relevance is defined by a similarity function $ϕ (h_{q}, h_{d})$ , usually the inner product or cosine similarity.

The Bi-Encoder vs. Cross-Encoder Tradeoff

Unlike the Cross-Encoder (which allows all-to-all attention between query and document terms), the Bi-Encoder (or dual-encoder) processes the query and document independently.

Pros: Document representations can be precomputed and indexed; supports sub-linear time retrieval via Approximate Nearest Neighbor (ANN) search.
Cons: Loses the fine-grained interaction between $q$ and $d$ terms, typically leading to lower effectiveness (the “interaction gap”).

5.2 Bi-Encoder Architecture

A standard bi-encoder uses two encoders (often sharing weights, i.e., Siamese network):

$h_{q} = enc_{q} (q)$
$h_{d} = enc_{d} (d)$

Similarity Score

The ranking score is computed as: $s (q, d) = ϕ (η_{q} (q), η_{d} (d)) = h_{q}^{⊤} h_{d}$

Commonly, the [CLS] token representation from Transformers like BERT for IR is used as the aggregate vector.

5.3 Training Strategies for Dense Retrieval

5.3.1 Contrastive Learning and Loss Functions

Bi-encoders are trained to maximize the similarity of positive pairs $(q, d^{+})$ and minimize the similarity of negative pairs $(q, d^{-})$ .

InfoNCE / Contrastive Loss

For a query $q_{i}$ , a positive document $d_{i}^{+}$ , and a set of negatives ${d_{i, j}^{-}}$ , the loss is: $L_{i} = - lo g \frac{e x p ( s ( q _{i} , d _{i}^{+} ))}{e x p ( s ( q _{i} , d _{i}^{+} )) + \sum _{j} e x p ( s ( q _{i} , d _{i, j}^{-} ))}$

5.3.2 Selecting Negative Examples

Success in dense retrieval is highly dependent on the “quality” of negative samples:

In-batch Negatives: Used in DPR (Dense Passage Retrieval). For a batch of $B$ queries, the $B - 1$ positive documents for other queries in the same batch serve as negatives for the current query.
Hard Negative Mining: Selecting negatives that the current model evaluates as highly relevant (high score) but are actually non-relevant.
ANCE: (Approximate Nearest Neighbor Negative Contrastive Estimation) involves iteratively updating the Inverted Index/ANN index during training to sample the most informative “global” hard negatives.

5.3.3 Knowledge Distillation

Distilling a powerful Cross-Encoder (Teacher) into a Bi-Encoder (Student) often yields better results than training on labels alone.

Margin-MSE: Distils the score margins to preserve the ranking order.

Margin-MSE Loss

$L = MSE (s_{s t u d e n t} (q, d^{+}) - s_{s t u d e n t} (q, d^{-}), s_{t e a c h er} (q, d^{+}) - s_{t e a c h er} (q, d^{-}))$

5.4 Late Interaction: ColBERT

ColBERT (Contextualized Late Interaction over BERT) bridges the gap between bi-encoders and cross-encoders. It stores multiple embeddings per document (one per token).

Intuition

Instead of compressing a document into one vector, ColBERT delays the interaction until the very end using a MaxSim operator, allowing query terms to match the “best” document term.

MaxSim Operator

$s_{q, d} = \sum_{i \in η (q)} max_{j \in η (d)} η (q)_{i} \cdot η (d)_{j}$

Pros: High effectiveness, competitive with Cross-Encoders.
Warning: High storage requirement. Indexing millions of passages can require hundreds of GBs of RAM/Disk to store per-token vectors.

5.5 ANN Search: Indexing and Retrieval

To avoid $O (N)$ brute-force search over the corpus, dense retrieval relies on:

Approximate Nearest Neighbor (ANN): Algorithms like HNSW (Hierarchical Navigable Small World) or FAISS.
Product Quantization (PQ): Compressing vectors into short codes to save memory and speed up distance calculations.
LSH: (Locality Sensitive Hashing) for grouping similar vectors.

5.6 Learned Sparse Retrieval

Methods like SPLADE and uniCOIL use transformer encoders but project the output back into the vocabulary space ( $∣ V ∣ \approx 30, 000$ ).

Hybrid Approach

These methods produce “sparse” vectors where most dimensions are zero, allowing them to use traditional Inverted Index structures while benefiting from neural term expansion and weighting.

SPLADE: Uses Log-Saturation effect and Sparsity regularization (FLOPs/ $L_{1}$ ) to learn which terms to expand.

SPLADE regularization

$L = L_{r ankin g} + λ_{1} ∣∣ w_{q} ∣ ∣_{1} + λ_{2} ∣∣ w_{d} ∣ ∣_{1}$

5.7 Comparison and Summary

Feature	Sparse (BM25)	Dense (Bi-Encoder)	Late Interaction (ColBERT)
Matching	Exact term match	Semantic/Latent	Token-level semantic
Index	Inverted Index	ANN Index	Multi-vector ANN
Efficiency	Very High	High	Medium
Effectiveness	Baseline	High	Very High

Key Takeaways

Bi-encoders provide a massive speedup by decoupling query and document processing.
The bottleneck is often the interaction gap; ColBERT is a leading solution.
Training Matters: The choice of negatives (In-batch vs. ANCE) and distillation (Margin-MSE) is critical.
Learned Sparse Retrieval (SPLADE) offers a middle ground, providing neural expansion within efficient inverted indexes.

Study Notes

Explorer

IR-PTR Ch5 - Dense Retrieval and Learned Sparse Retrieval

IR-PTR Chapter 5: Dense Retrieval and Learned Sparse Retrieval

5.1 Overview: The Dense Retrieval Paradigm

The Bi-Encoder vs. Cross-Encoder Tradeoff

5.2 Bi-Encoder Architecture

5.3 Training Strategies for Dense Retrieval

5.3.1 Contrastive Learning and Loss Functions

5.3.2 Selecting Negative Examples

5.3.3 Knowledge Distillation

5.4 Late Interaction: ColBERT

5.5 ANN Search: Indexing and Retrieval

5.6 Learned Sparse Retrieval

5.7 Comparison and Summary

Key Takeaways

Graph View

Table of Contents

Backlinks