DPR
DPR (Dense Passage Retrieval)
DPR is a retrieval model that maps queries and documents into a shared dense vector space using BERT-based bi-encoders. Retrieval is performed using maximum inner product search (MIPS) on these embeddings.
Scoring via Dual Encoders
The relevance score is the dot product of two independently computed embeddings:
where:
- — Dense representation of the query (Query Encoder)
- — Dense representation of the document (Document Encoder)
- Both encoders are typically BERT-based.
Key Components
- Bi-Encoder Architecture: Unlike Cross-Encoders, the query and document do not see each other during encoding.
- Indexing: Document embeddings are pre-computed and stored in an ANN index (e.g., FAISS).
- Training: Uses Contrastive Learning with an InfoNCE-like objective.
- Negative Sampling: Crucially uses In-batch Negatives — for a batch of queries and their relevant documents, the other documents in the batch serve as negatives for each query.
Semantic Search
Because it uses dense vectors, DPR can find documents that are topically relevant but don’t share any words with the query (solving the “lexical mismatch” problem of BM25).
Connections
- Trained via: Contrastive Learning, Hard Negative Mining.
- Efficiency: Uses Approximate Nearest Neighbor for retrieval.
- Role: Often the first stage in Multi-Stage Ranking.
- Comparison: Faster than MonoBERT (scoring is just a dot product).