IR-PTR Ch3 - Multi-Stage Architectures for Reranking
Overview
The core formulation of text ranking in the transformer era is relevance classification. This involves training a classifier to estimate the probability that a text belongs to the “relevant” class and sorting documents by these probabilities at inference time. This is a direct realization of the Probability Ranking Principle.
Relevance Classification
Sorting texts based on the estimated probability where is a document and is a query.
Retrieve-and-Rerank Architecture
To handle scalability, most systems use a multi-stage approach:
- Candidate Generation (First-stage): Uses an inverted index with efficient scoring like BM25 to retrieve candidates (e.g., ).
- Reranking (Second-stage): Uses a complex model like MonoBERT to rerank the candidates.
3.1 BERT Basics for IR
Transformers-based contextual embeddings (like BERT) capture syntax, semantics, and polysemy better than static embeddings (word2vec, GloVe).
Architecture & Components
- Input Template:
[[CLS], tokens..., [SEP], tokens..., [SEP]] - Embeddings: Token + Segment (A/B) + Position embeddings (summed element-wise).
- Contextual Output: is typically used as an aggregate representation for classification.
Pretraining Objectives
- Masked Language Model (MLM): Predicting “masked” tokens using bidirectional context.
- Next Sentence Prediction (NSP): Predicting if segment B follows segment A.
Pretrain-then-Fine-tune
The standard recipe: pretrain on massive corpora using self-supervision (MLM), then fine-tune on task-specific labeled data (e.g., MS MARCO).
3.2 MonoBERT: The Baseline Reranker
The first adaptation of BERT for ranking treats query and document as two segments in a Cross-Encoder setup.
monoBERT Scoring
The relevance score is computed using a fully-connected layer over the token: Where and .
Training
Trained using pointwise Cross-Entropy Loss:
Key Findings
- Data Hunger: monoBERT requires significant data (thousands of pairs) to beat BM25.
- Position matters: Removing position embeddings significantly degrades performance.
- No “help” needed: Interpolating with BM25 scores often doesn’t help monoBERT on MS MARCO once candidates are selected.
3.3 Passage to Document Ranking (Handling Long Texts)
BERT’s 512-token limit creates challenges for “full-length” documents (news articles, papers).
3.3.1 Birch (Sentence-level)
- Approach: Rerank individual sentences and aggregate scores.
- Aggregation: Documents are scored based on the top scoring sentences combined with the original document score:
- Intuition: The highest-scoring sentence is a good proxy for document relevance. Supports zero-shot cross-domain transfer (e.g., training on tweets, testing on news).
3.3.2 BERT-MaxP (Passage-level)
- Approach: Segment documents into overlapping passages (e.g., 150 words, stride 75).
- Aggregation: (MaxP). FirstP and SumP are alternatives.
- Query Representation: Sentence-long natural language “descriptions” outperform keyword “titles” because BERT exploits non-content words (prepositions, etc.) for deeper context.
3.3.3 CEDR (Contextualized Embeddings)
- Approach: Uses all contextual embeddings , not just .
- Design: Feeds BERT embeddings into pre-BERT interaction models like KNRM or PACRR to build similarity matrices.
- Aggregation: Representation aggregation across document chunks.
3.3.4 PARADE (Representation Aggregation)
- Approach: Aggregates representations ( vectors) from passages using a hierarchical transformer or CNN.
- Model:
- Advantage: End-to-end differentiable, unifies training/inference, more effective than simple score aggregation.
3.4 Multi-Stage Reranking & Pipelines
3.4.1 Pairwise Reranking (duoT5 / duoBERT)
Instead of pointwise probability, estimate .
- Cost: Complexity , requiring a multi-stage approach where monoBERT filters to a smaller set (e.g., 50) before duoBERT processes it.
- Loss: Pairwise hinge/logistic loss.
3.4.2 Cascade Transformers
Treats transformer layers as a pipeline with early exits.
- Intuition: Discard clear non-relevant candidates after a few layers (e.g., 4 or 6) rather than running all 12 layers on 1000 documents.
3.5 Beyond BERT
3.5.1 Knowledge Distillation
Distilling a large “Teacher” into a “Student” (e.g., TinyBERT, DistilBERT).
- Objective: Minimize MSE between teacher and student logits or hidden states.
- Efficiency: Can achieve up to 9x speedup with minimal loss in effectiveness.
3.5.2 Local Architectures (TK, TKL)
- Transformer Kernel (TK): Small, from-scratch transformers with local attention to avoid the cost and use interaction kernels (KNRM).
- Conformer Kernel (CK): Mixes convolutions and attention for higher efficiency.
3.5.3 Ranking with Sequence-to-Sequence (monoT5)
- Concept: “Text-to-text” paradigm for everything.
- Encoding:
Query: q Document: d Relevant: - Decoding: Model generates the token
"true"or"false". - Score Calculation:
- Benefit: Highly effective and extremely data-efficient (great for few-shot).
3.5.4 Generative Query Likelihood
Ranking by using models like BART or GPT-2. The “reverse” of standard classification.
Summary Takeaways
- Relevance Classification is the dominant paradigm (Cross-Encoders).
- Aggregation Matters: Moving from passage scores to passage representations (PARADE) improves document ranking.
- Efficiency Tradeoffs: Pairwise models (duoT5) add quality but require a second stage to manage latency.
- Generative Future: monoT5 and seq2seq models are currently state-of-the-art for many tasks.