COIL
COIL (Contextualized Inverted List)
COIL is a retrieval model that combines the efficiency of exact lexical matching with the power of contextualized token representations. It stores low-dimensional BERT-based token embeddings directly in an inverted index, allowing for “matching by embedding” within the framework of traditional sparse retrieval.
The Bridge Between Sparse and Dense
- Sparse (BM25): Matches exact tokens, but forgets context (polysemy).
- Dense (DPR): Captures global context, but might miss exact keyword matches.
- COIL: Matches the same token (lexical) but checks if they have similar embeddings (contextual). It asks: “Is the word ‘bank’ in the query used in the same sense as ‘bank’ in the document?”
Key Mechanism
- Encoding: Each token in the corpus is encoded by a BERT-like transformer into a contextual vector.
- Indexing: The inverted index stores entries in the form:
term -> (doc_id, vector). - Retrieval: At query time, the system finds documents containing the query terms and computes the similarity (e.g., dot product) between query token vectors and document token vectors.
Connections
- Comparison: A more efficient alternative to ColBERT, which uses late interaction across all tokens.
- Category: Learned Sparse Retrieval.
- Successors: Influenced models like SPLADE which further sparsify the representation.