DSI
DSI (Differentiable Search Index)
DSI is a “generative retrieval” paradigm where a single neural model (e.g., T5) maps a query directly to a Document ID (docid). The model itself acts as the search index.
Model as Index
Phases in DSI
- Indexing Phase: The model is trained to memorize the mapping: .
- Retrieval Phase: At inference time, given a , the model generates the most likely .
Identifier Design (DocIDs)
The choice of DocID is critical for performance:
- Unstructured IDs: Arbitrary integer for each document (hard for the model to learn).
- Semantic IDs: Hierarchical IDs (e.g.,
1.4.2) derived from clustering document content. This allows the model to predict the “category” first.
Pros and Cons
- Pros: End-to-end differentiable; no separate ANN index required.
- Cons: Difficult to update (adding a new document requires retraining/fine-tuning); limited to smaller collections.
Connections
- Category: Generative Retrieval.
- Comparison: Different from DPR (search-by-embedding) or GENRE (search-by-entity-name).
- Components: Uses hierarchical clustering for DocIDs.