IR-PTR Ch3 - Multi-Stage Architectures for Reranking

Overview

The core formulation of text ranking in the transformer era is relevance classification. This involves training a classifier to estimate the probability that a text belongs to the “relevant” class and sorting documents by these probabilities at inference time. This is a direct realization of the Probability Ranking Principle.

Relevance Classification

Sorting texts based on the estimated probability where is a document and is a query.

Retrieve-and-Rerank Architecture

To handle scalability, most systems use a multi-stage approach:

  1. Candidate Generation (First-stage): Uses an inverted index with efficient scoring like BM25 to retrieve candidates (e.g., ).
  2. Reranking (Second-stage): Uses a complex model like MonoBERT to rerank the candidates.

3.1 BERT Basics for IR

Transformers-based contextual embeddings (like BERT) capture syntax, semantics, and polysemy better than static embeddings (word2vec, GloVe).

Architecture & Components

  • Input Template: [[CLS], tokens..., [SEP], tokens..., [SEP]]
  • Embeddings: Token + Segment (A/B) + Position embeddings (summed element-wise).
  • Contextual Output: is typically used as an aggregate representation for classification.

Pretraining Objectives

  1. Masked Language Model (MLM): Predicting “masked” tokens using bidirectional context.
  2. Next Sentence Prediction (NSP): Predicting if segment B follows segment A.

Pretrain-then-Fine-tune

The standard recipe: pretrain on massive corpora using self-supervision (MLM), then fine-tune on task-specific labeled data (e.g., MS MARCO).


3.2 MonoBERT: The Baseline Reranker

The first adaptation of BERT for ranking treats query and document as two segments in a Cross-Encoder setup.

monoBERT Scoring

The relevance score is computed using a fully-connected layer over the token: Where and .

Training

Trained using pointwise Cross-Entropy Loss:

Key Findings

  • Data Hunger: monoBERT requires significant data (thousands of pairs) to beat BM25.
  • Position matters: Removing position embeddings significantly degrades performance.
  • No “help” needed: Interpolating with BM25 scores often doesn’t help monoBERT on MS MARCO once candidates are selected.

3.3 Passage to Document Ranking (Handling Long Texts)

BERT’s 512-token limit creates challenges for “full-length” documents (news articles, papers).

3.3.1 Birch (Sentence-level)

  • Approach: Rerank individual sentences and aggregate scores.
  • Aggregation: Documents are scored based on the top scoring sentences combined with the original document score:
  • Intuition: The highest-scoring sentence is a good proxy for document relevance. Supports zero-shot cross-domain transfer (e.g., training on tweets, testing on news).

3.3.2 BERT-MaxP (Passage-level)

  • Approach: Segment documents into overlapping passages (e.g., 150 words, stride 75).
  • Aggregation: (MaxP). FirstP and SumP are alternatives.
  • Query Representation: Sentence-long natural language “descriptions” outperform keyword “titles” because BERT exploits non-content words (prepositions, etc.) for deeper context.

3.3.3 CEDR (Contextualized Embeddings)

  • Approach: Uses all contextual embeddings , not just .
  • Design: Feeds BERT embeddings into pre-BERT interaction models like KNRM or PACRR to build similarity matrices.
  • Aggregation: Representation aggregation across document chunks.

3.3.4 PARADE (Representation Aggregation)

  • Approach: Aggregates representations ( vectors) from passages using a hierarchical transformer or CNN.
  • Model:
  • Advantage: End-to-end differentiable, unifies training/inference, more effective than simple score aggregation.

3.4 Multi-Stage Reranking & Pipelines

3.4.1 Pairwise Reranking (duoT5 / duoBERT)

Instead of pointwise probability, estimate .

  • Cost: Complexity , requiring a multi-stage approach where monoBERT filters to a smaller set (e.g., 50) before duoBERT processes it.
  • Loss: Pairwise hinge/logistic loss.

3.4.2 Cascade Transformers

Treats transformer layers as a pipeline with early exits.

  • Intuition: Discard clear non-relevant candidates after a few layers (e.g., 4 or 6) rather than running all 12 layers on 1000 documents.

3.5 Beyond BERT

3.5.1 Knowledge Distillation

Distilling a large “Teacher” into a “Student” (e.g., TinyBERT, DistilBERT).

  • Objective: Minimize MSE between teacher and student logits or hidden states.
  • Efficiency: Can achieve up to 9x speedup with minimal loss in effectiveness.

3.5.2 Local Architectures (TK, TKL)

  • Transformer Kernel (TK): Small, from-scratch transformers with local attention to avoid the cost and use interaction kernels (KNRM).
  • Conformer Kernel (CK): Mixes convolutions and attention for higher efficiency.

3.5.3 Ranking with Sequence-to-Sequence (monoT5)

  • Concept: “Text-to-text” paradigm for everything.
  • Encoding: Query: q Document: d Relevant:
  • Decoding: Model generates the token "true" or "false".
  • Score Calculation:
  • Benefit: Highly effective and extremely data-efficient (great for few-shot).

3.5.4 Generative Query Likelihood

Ranking by using models like BART or GPT-2. The “reverse” of standard classification.


Summary Takeaways

  1. Relevance Classification is the dominant paradigm (Cross-Encoders).
  2. Aggregation Matters: Moving from passage scores to passage representations (PARADE) improves document ranking.
  3. Efficiency Tradeoffs: Pairwise models (duoT5) add quality but require a second stage to manage latency.
  4. Generative Future: monoT5 and seq2seq models are currently state-of-the-art for many tasks.