IR-PTR Ch4: Refining Query and Document Representations

Overview

This chapter explores techniques to mitigate the vocabulary mismatch problem, where the terms used in queries differ from those in relevant documents. While Transformers solve this via semantic matching in reranking, the initial candidate generation stage (e.g., BM25) remains a bottleneck. Refining query and document representations allows these techniques to bridge the gap between classical Information Retrieval (exact match) and neural approaches.

4.1 General Remarks on Expansion

Both Query Expansion and Document Expansion aim to align representations by adding or reweighting terms.

Feature	Query Expansion	Document Expansion
Advantage	Flexibility, short experimental cycles, can aggregate evidence (post-retrieval).	Richer context for transformers, embarrassingly parallel, pushes inference to indexing time.
Disadvantage	Longer queries increase retrieval latency.	Indexing time/cost, requires re-indexing for any model change.

Intuition

Document expansion is like “predictive annotation”—adding terms a document should have been tagged with to be findable.

4.2 Pseudo-Relevance Feedback with Contextualized Embeddings: CEQE

Pseudo-relevance feedback (PRF) assumes the top-k results of an initial search are relevant and uses them to expand the query.

Rocchio Algorithm

One of the earliest PRF methods, performing manipulations in the vector space model.

$q_{n e w} = α q_{o l d} + β \frac{1}{∣ D _{re l} ∣} \sum_{d \in D_{re l}} d - γ \frac{1}{∣ D _{n re l} ∣} \sum_{d \in D_{n re l}} d$

CEQE (Contextualized Embeddings for Query Expansion)

Standard PRF with BERT for IR is difficult because BERT expects natural language, not keyword lists. CEQE uses BERT’s contextual embeddings from the 11th layer to calculate the probability of an expansion term $w$ .

CEQE Probability

$p (w, Q, D) = \sum_{D} p (w ∣ Q, D) p (Q ∣ D) p (D)$ Where $p (w ∣ Q, D)$ is calculated using cosine similarity between term mentions and a query centroid (or term-based representation with pooling).

4.3 Document Expansion via Query Prediction: doc2query

Also known as docTTTTTquery when using T5.

doc2query

A sequence-to-sequence model (like T5) is trained to generate potential queries given a document as input. These queries are then appended to the original document before indexing.

Key Findings:

New vs. Copied Terms: Approximately 31% of predicted terms are “new” (not in the doc), helping bridge the vocabulary mismatch. 69% are “copied” (term reweighting).
Effectiveness: Often achieves the effectiveness of non-BERT neural models using only basic keyword search.
Independence: The technique is a “free boost” for first-stage retrieval that doesn’t require GPU inference at query time.

4.4 Term Reweighting as Regression: DeepCT

Unlike doc2query, which reweights terms indirectly by repetition, DeepCT directly predicts term importance.

Query Term Recall (QTR)

The label used to train DeepCT is the fraction of relevant queries containing term $t$ : $QTR (t, d) = \frac{∣ Q _{d, t} ∣}{∣ Q _{d} ∣}$

Mechanism:

Regression: A BERT-based model takes document $d$ and outputs importance score $\overset{y}{^}_{t, d}$ for each term $t$ .
Indexing: Scores are rescaled (e.g., 0-100) and treated as term frequencies in a standard Inverted Index.
Efficiency: Only one inference pass per document is needed, compared to multiple sampling passes for doc2query.

4.5 Term Reweighting with Weak Supervision: HDCT

HDCT (Hierarchical Document Term Weighting) extends DeepCT for long documents.

Workflow

Split document into passages.

Pass each passage through BERT to get term weights $\overset{y}{^}_{t, p}$ .

Aggregate passage weights into a document-level weight: $D (d) = \sum pw i \times P (p i)$ .

Use “Sum” or “Decay” (discounting later passages) for weights.

Weak Supervision: Since passage-level judgments are rare, HDCT uses document titles or pseudo-relevant documents to generate synthetic training labels.

4.6 Combining Expansion and Reweighting: DeepImpact

DeepImpact combines doc2query’s expansion with a scoring model to obtain the “best of both worlds.”

Expand: Generate terms with doc2query-T5.
Weight: Use BERT and an MLP to predict “impact” weights for both original and expansion terms.
Index: Store quantized weights (impacts) in the term frequency position.

Intuition

DeepImpact allows keyword retrieval to approach the effectiveness of Neural Reranking (like monoBERT) while being an order of magnitude faster and requiring no query-time neural inference.

4.7 Connection to Sparse Retrieval

These techniques refine textual representations but still utilize the Inverted Index. They represent a bridge toward Learned Sparse Retrieval (e.g., SPLADE), where the model learns a high-dimensional sparse vector across the entire vocabulary rather than just weighting existing/predicted terms.

Summary

Query refinement (CEQE) is powerful but computationally expensive at query time.
Document refinement (doc2query, DeepCT, DeepImpact) pushes the “heavy lifting” to indexing time.
These methods solve Information Retrieval’s oldest problem—vocabulary mismatch—within the framework of efficient classical search engines.

Study Notes

Explorer

IR-PTR Ch4 - Refining Query and Document Representations

IR-PTR Ch4: Refining Query and Document Representations

Overview

4.1 General Remarks on Expansion

4.2 Pseudo-Relevance Feedback with Contextualized Embeddings: CEQE

CEQE (Contextualized Embeddings for Query Expansion)

4.3 Document Expansion via Query Prediction: doc2query

Key Findings:

4.4 Term Reweighting as Regression: DeepCT

Mechanism:

4.5 Term Reweighting with Weak Supervision: HDCT

4.6 Combining Expansion and Reweighting: DeepImpact

4.7 Connection to Sparse Retrieval

Summary

Graph View

Table of Contents

Backlinks