IR-PTR Ch3 - Multi-Stage Architectures for Reranking

Overview

The core formulation of text ranking in the transformer era is relevance classification. This involves training a classifier to estimate the probability that a text belongs to the “relevant” class and sorting documents by these probabilities at inference time. This is a direct realization of the Probability Ranking Principle.

Relevance Classification

Sorting texts based on the estimated probability $P (Relevant = 1∣ d, q)$ where $d$ is a document and $q$ is a query.

Retrieve-and-Rerank Architecture

To handle scalability, most systems use a multi-stage approach:

Candidate Generation (First-stage): Uses an inverted index with efficient scoring like BM25 to retrieve $k$ candidates (e.g., $k = 1000$ ).
Reranking (Second-stage): Uses a complex model like MonoBERT to rerank the candidates.

3.1 BERT Basics for IR

Transformers-based contextual embeddings (like BERT) capture syntax, semantics, and polysemy better than static embeddings (word2vec, GloVe).

Architecture & Components

Input Template: [[CLS], tokens..., [SEP], tokens..., [SEP]]
Embeddings: Token + Segment (A/B) + Position embeddings (summed element-wise).
Contextual Output: $T_{[C L S]}$ is typically used as an aggregate representation for classification.

Pretraining Objectives

Masked Language Model (MLM): Predicting “masked” tokens using bidirectional context.
Next Sentence Prediction (NSP): Predicting if segment B follows segment A.

Pretrain-then-Fine-tune

The standard recipe: pretrain on massive corpora using self-supervision (MLM), then fine-tune on task-specific labeled data (e.g., MS MARCO).

3.2 MonoBERT: The Baseline Reranker

The first adaptation of BERT for ranking treats query $q$ and document $d$ as two segments in a Cross-Encoder setup.

monoBERT Scoring

The relevance score $s_{i}$ is computed using a fully-connected layer over the $T_{[C L S]}$ token: $s_{i} = softmax (T_{[C L S]} W + b)_{1}$ Where $W \in R^{D \times 2}$ and $b \in R^{2}$ .

Training

Trained using pointwise Cross-Entropy Loss: $L = - \sum_{j \in J_{p os}} lo g (s_{j}) - \sum_{j \in J_{n e g}} lo g (1 - s_{j})$

Key Findings

Data Hunger: monoBERT requires significant data (thousands of pairs) to beat BM25.
Position matters: Removing position embeddings significantly degrades performance.
No “help” needed: Interpolating with BM25 scores often doesn’t help monoBERT on MS MARCO once $k = 1000$ candidates are selected.

3.3 Passage to Document Ranking (Handling Long Texts)

BERT’s 512-token limit creates challenges for “full-length” documents (news articles, papers).

3.3.1 Birch (Sentence-level)

Approach: Rerank individual sentences and aggregate scores.
Aggregation: Documents are scored based on the top $n$ scoring sentences combined with the original document score: $s_{f} = α \cdot s_{d} + (1 - α) \cdot \sum_{i = 1}^{n} w_{i} \cdot s_{i}$
Intuition: The highest-scoring sentence is a good proxy for document relevance. Supports zero-shot cross-domain transfer (e.g., training on tweets, testing on news).

3.3.2 BERT-MaxP (Passage-level)

Approach: Segment documents into overlapping passages (e.g., 150 words, stride 75).
Aggregation: $s_{d} = max s_{i}$ (MaxP). FirstP and SumP are alternatives.
Query Representation: Sentence-long natural language “descriptions” outperform keyword “titles” because BERT exploits non-content words (prepositions, etc.) for deeper context.

3.3.3 CEDR (Contextualized Embeddings)

Approach: Uses all contextual embeddings $T_{1} \dots T_{n}$ , not just $T_{[C L S]}$ .
Design: Feeds BERT embeddings into pre-BERT interaction models like KNRM or PACRR to build similarity matrices.
Aggregation: Representation aggregation across document chunks.

3.3.4 PARADE (Representation Aggregation)

Approach: Aggregates representations ( $p_{c l s}$ vectors) from passages using a hierarchical transformer or CNN.
Model: $d_{c l s} = Transformer ([C L S], p_{c l s 1}, \dots, p_{c l s n})$
Advantage: End-to-end differentiable, unifies training/inference, more effective than simple score aggregation.

3.4 Multi-Stage Reranking & Pipelines

3.4.1 Pairwise Reranking (duoT5 / duoBERT)

Instead of pointwise probability, estimate $P (d_{i} ≻ d_{j} ∣ d_{i}, d_{j}, q)$ .

Cost: Complexity $O (k^{2})$ , requiring a multi-stage approach where monoBERT filters $k$ to a smaller set (e.g., 50) before duoBERT processes it.
Loss: Pairwise hinge/logistic loss.

3.4.2 Cascade Transformers

Treats transformer layers as a pipeline with early exits.

Intuition: Discard clear non-relevant candidates after a few layers (e.g., 4 or 6) rather than running all 12 layers on 1000 documents.

3.5 Beyond BERT

3.5.1 Knowledge Distillation

Distilling a large “Teacher” into a “Student” (e.g., TinyBERT, DistilBERT).

Objective: Minimize MSE between teacher and student logits or hidden states.
Efficiency: Can achieve up to 9x speedup with minimal loss in effectiveness.

3.5.2 Local Architectures (TK, TKL)

Transformer Kernel (TK): Small, from-scratch transformers with local attention to avoid the $O (L^{2})$ cost and use interaction kernels (KNRM).
Conformer Kernel (CK): Mixes convolutions and attention for higher efficiency.

3.5.3 Ranking with Sequence-to-Sequence (monoT5)

Concept: “Text-to-text” paradigm for everything.
Encoding: Query: q Document: d Relevant:
Decoding: Model generates the token "true" or "false".
Score Calculation: $P (Relevant) = \frac{e ^{logit (true)}}{e ^{logit (true)} + e ^{logit (false)}}$
Benefit: Highly effective and extremely data-efficient (great for few-shot).

3.5.4 Generative Query Likelihood

Ranking by $P (q ∣ d)$ using models like BART or GPT-2. The “reverse” of standard classification.

Summary Takeaways

Relevance Classification is the dominant paradigm (Cross-Encoders).
Aggregation Matters: Moving from passage scores to passage representations (PARADE) improves document ranking.
Efficiency Tradeoffs: Pairwise models (duoT5) add quality but require a second stage to manage latency.
Generative Future: monoT5 and seq2seq models are currently state-of-the-art for many tasks.

Study Notes

Explorer

IR-PTR Ch3 - Multi-Stage Architectures for Reranking

IR-PTR Ch3 - Multi-Stage Architectures for Reranking

Overview

Retrieve-and-Rerank Architecture

3.1 BERT Basics for IR

Architecture & Components

Pretraining Objectives

3.2 MonoBERT: The Baseline Reranker

Training

Key Findings

3.3 Passage to Document Ranking (Handling Long Texts)

3.3.1 Birch (Sentence-level)

3.3.2 BERT-MaxP (Passage-level)

3.3.3 CEDR (Contextualized Embeddings)

3.3.4 PARADE (Representation Aggregation)

3.4 Multi-Stage Reranking & Pipelines

3.4.1 Pairwise Reranking (duoT5 / duoBERT)

3.4.2 Cascade Transformers

3.5 Beyond BERT

3.5.1 Knowledge Distillation

3.5.2 Local Architectures (TK, TKL)

3.5.3 Ranking with Sequence-to-Sequence (monoT5)

3.5.4 Generative Query Likelihood

Summary Takeaways

Graph View

Table of Contents

Backlinks