DocT5Query
DocT5Query
DocT5Query (also known as doc2query—) is a document expansion method that uses a sequence-to-sequence model (specifically T5) to generate synthetic queries for a document. These generated queries are appended to the document text before indexing with a standard sparse retriever like BM25.
Bridging the Vocabulary Gap
Traditional lexical search requires exact term overlap. DocT5Query predicts what questions a document might answer. By adding these predicted questions (and their vocabulary) to the document, the model helps bridge the gap between user queries and document language, effectively performing “expansion at index time.”
Key Mechanism
- Training: A T5 model is trained on a dataset of (Query, Document) pairs to predict the query given the document.
- Generation: For every document in the collection, the model generates (e.g., 40-80) synthetic queries.
- Appending: The generated queries are concatenated to the original document text.
- Indexing: The expanded documents are indexed using a traditional inverted index (e.g., Lucene, Pyserini).
Connections
- Extends: BM25 by adding semantic context via expansion.
- Comparison: Unlike DeepCT, which re-weights existing terms, DocT5Query adds new terms.
- Complementary to: Neural Reranking, where DocT5Query provides a better set of initial candidates.