Study Notes

❯

❯

DocT5Query

Jun 06, 20261 min read

neural-ir
document-expansion

DocT5Query

DocT5Query

DocT5Query (also known as doc2query—) is a document expansion method that uses a sequence-to-sequence model (specifically T5) to generate synthetic queries for a document. These generated queries are appended to the document text before indexing with a standard sparse retriever like BM25.

Bridging the Vocabulary Gap

Traditional lexical search requires exact term overlap. DocT5Query predicts what questions a document might answer. By adding these predicted questions (and their vocabulary) to the document, the model helps bridge the gap between user queries and document language, effectively performing “expansion at index time.”

Key Mechanism

Training: A T5 model is trained on a dataset of (Query, Document) pairs to predict the query given the document.
Generation: For every document in the collection, the model generates $N$ (e.g., 40-80) synthetic queries.
Appending: The generated queries are concatenated to the original document text.
Indexing: The expanded documents are indexed using a traditional inverted index (e.g., Lucene, Pyserini).

Connections

Extends: BM25 by adding semantic context via expansion.
Comparison: Unlike DeepCT, which re-weights existing terms, DocT5Query adds new terms.
Complementary to: Neural Reranking, where DocT5Query provides a better set of initial candidates.

Appears In

IR-L07 - Learned Sparse Retrieval

Graph View

DocT5Query
Key Mechanism
Connections
Appears In

Backlinks

DeepImpact
Transformers
uniCOIL

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community