IR Lecture 3: Retrieval Models
1. Introduction to Retrieval Models
Retrieval models provide a mathematical framework for defining query-document matching. They include assumptions about relevance and serve as the basis for ranking algorithms.
Retrieval models are generally categorized into two paradigms:
- Lexical Matching: Matching based on word occurrences (terms).
- Vector Space Model (VSM): TF-IDF weighting.
- Probabilistic Models: BM25.
- Language Models: Query Likelihood.
- Semantic Matching: Matching based on meaning/representations.
- Distributed Representations: Neural models (e.g., word embeddings, BERT).
2. Vector Space Model & Term Weighting
In the Vector Space Model (VSM), documents and queries are represented as vectors in a high-dimensional space where each dimension corresponds to a term in the vocabulary.
2.1 Lexical Incidence Matrix
The simplest representation is binary, indicating presence (1) or absence (0) of a term. Similarity is often measured using Cosine Similarity:
2.2 Term Frequency (TF)
Raw frequency counts are better than binary indicators but have diminishing returns.
Retrieval Axioms: Term Frequency
- TFC1: Higher score for documents with more query term occurrences.
- TFC2 (Saturation): The increase in score due to TF should be sub-linear (the difference between 1 and 2 occurrences is more significant than the difference between 100 and 101).
- TFC3: If total occurrences are equal, documents covering more distinct query terms should be preferred.
Common sub-linear transformation (logarithmic):
2.3 Inverse Document Frequency (IDF)
According to Zipf’s Law, a small set of words appear very frequently. These words (e.g., “the”, “and”) are poor discriminators.
Term Discrimination Constraint (TDC)
Terms popular across the entire collection should be penalized.
where:
- — total number of documents in collection.
- — number of documents containing term .
2.4 TF-IDF Weighting
The weight of a term in document is:
3. Probabilistic Models: BM25
BM25 (Best Matching 25) is the most widely used weighting function in IR. It effectively balances TF, IDF, and document length.
3.1 The BM25 Formula
For a query containing terms , the score for document is:
where:
- — term frequency of in document .
- — length of document (number of tokens).
- — average document length in the collection.
- — term frequency saturation parameter (typically to ). Controls how quickly TF effect saturates.
- — length normalization parameter (typically ). is full normalization, is no normalization.
3.2 Intuition
- Saturation: As increases, the score approaches a limit.
- Length Normalization: Longer documents are expected to have higher TFs naturally; we normalize to avoid bias toward long documents that aren’t necessarily more relevant.
4. Language Models for IR (LMIR)
Instead of matching vectors, we treat retrieval as a generative process. We estimate a Language Model for each document and ask: “What is the probability that this document model generated the query?“
4.1 Query Likelihood Model
The documents are ranked by the probability . Using Bayes’ Rule: Assuming a uniform prior , we rank by .
Under the Multinomial Assumption (terms generated independently): In log-space (to avoid underflow):
4.2 Estimation and Smoothing
The Maximum Likelihood Estimate (MLE) for a term in a document is:
Problem: If a query term is missing from document , , making the entire product 0. Solution: Smoothing — adjusting estimates to avoid zero probabilities and incorporate background knowledge (collection model ).
Smoothing Methods
-
Jelinek-Mercer (Linear Interpolation):
- Small (e.g., 0.1) highlights document-specific content (precision).
- Large (e.g., 0.7) better for long queries (recall).
-
Dirichlet Prior Smoothing:
- is the smoothing parameter (often ).
- It performs “Bayesian” smoothing where the amount of smoothing depends on the document length .
-
Absolute Discounting: Subtracts a constant from seen counts and redistributes it to unseen terms proportional to .
5. Model Comparison Summary
| Model | Foundation | Key Components |
|---|---|---|
| VSM (TF-IDF) | Geometry | TF, IDF, Cosine Sim |
| BM25 | Probabilistic (BIM) | Saturation-TF, IDF, Length Norm |
| LMIR | Probability/Generative | Term distribution, Smoothing |
Choice of Model
BM25 and LMIR with Dirichlet smoothing are generally state-of-the-art for lexical retrieval and perform similarly in practice. BM25 is easier to tune (parameters ), while LMIR is more theoretically grounded for extensions (e.g., translation models).
6. Optimization: Skip Pointers
To speed up query processing, inverted lists contain skip pointers.
- Allow jumping over large portions of the list if the current document is smaller than the document being evaluated on another list.
- Trade-off: More skips smaller data read, but more overhead in pointer storage. Optimal skip distance is typically around 100 bytes.
7. Update Strategies
- Index Merging: Create small new index and merge with old one periodically.
- Geometric Partitioning: Maintain multiple indexes of increasing size (). When reaches limit, merge into .