Chapter 7: Retrieval Models
7.1 Overview of Retrieval Models
Retrieval models provide a mathematical framework to formalize the process of deciding if a piece of text is relevant to an information need. Good models produce rankings that correlate with human relevance decisions, leading to high effectiveness.
Relevance
- Topical Relevance: Whether a document and query are “about the same thing.”
- User Relevance: A broader concept incorporating factors like age, language, novelty, and target audience.
- Binary vs. Multi-valued: While relevance is often multi-level in reality, many models assume binary relevance (relevant/not relevant) for simplicity, often calculating a probability of relevance to represent uncertainty.
7.1.1 Boolean Retrieval
The oldest model, also known as exact-match retrieval. Documents are retrieved only if they exactly match the query specification.
- Outcome: Binary (Matches or doesn’t). No inherent ranking among retrieved documents.
- Operators:
AND,OR,NOT, proximity operators, wildcards. - Pros: Predictable, easy to explain, efficient, allows complex metadata filtering.
- Cons: Effectiveness depends entirely on the user; “searching by numbers” (too many or too few results); lacks term weighting.
7.1.2 The Vector Space Model (VSM)
The Vector Space Model represents documents and queries as vectors in a -dimensional space, where is the number of index terms.
Cosine Similarity
The most successful similarity measure for ranking in VSM: Where is the weight of term in document , and is the weight in the query.
Term Weighting (TF-IDF):
- Term Frequency (tf): Reflects term importance within a document.
- Standard:
- Log-normalized (to reduce impact of frequent terms):
- Inverse Document Frequency (idf): Reflects term discriminative power in the collection.
- Where is total documents and is the number of documents containing term .
TF-IDF
A term is important if it occurs frequently in a specific document but rarely across the rest of the collection.
Rocchio Algorithm (Relevance Feedback): Used to modify the query vector into based on relevant and non-relevant sets:
7.2 Probabilistic Models
The dominant paradigm today, based on representing the uncertainty inherent in Information Retrieval.
7.2.1 Probability Ranking Principle (PRP)
Principle
If a system ranks documents in order of decreasing probability of relevance, the overall effectiveness of the system will be the best obtainable.
7.2.2 Binary Independence Model (BIM)
Treats IR as a classification problem. Documents are viewed as binary vectors (presence/absence of terms).
Assumptions:
- Binary Relevance.
- Naïve Bayes Assumption: Terms occur independently in relevant and non-relevant sets.
BIM Scoring Function
Derived from the likelihood ratio and Bayes’ Rule: Where and .
In the absence of relevance information, the weight simplifies to an IDF-like variant: .
7.2.3 The BM25 Ranking Algorithm
BM25 (Best Match 25) is a robust and widely used ranking algorithm that extends Binary Independence Model by adding term frequency and length normalization.
BM25 Score
Where:
- : term frequency in document.
- : term frequency in query.
- : document length; : average document length.
- : empirical parameters (typical: ).
Parameter k1
controls tf saturation. As increases, the marginal contribution of additional occurrences decreases.
7.3 Ranking Based on Language Models
Treats documents as being generated from a Language Model for IR (a probability distribution over words).
7.3.1 Query Likelihood Model
Ranks documents by the probability of generating the query text from the document’s language model .
7.3.2 Smoothing
Critical to avoid zero probabilities for missing query terms and to improve estimation.
Jelinek-Mercer Smoothing
A linear interpolation between the document MLE and the collection model:
Dirichlet Smoothing
Uses a document-length dependent weighting, generally more effective for short queries: Best results in TREC usually seen with .
7.3.3 Relevance Models and KL-Divergence
Generalizes the Query Likelihood Model by comparing two probability distributions: the Relevance Model () and the Document Model ().
KL-Divergence Ranking
This framework provides a formal basis for pseudo-relevance feedback and query expansion.
Comparison Summary
| Model | Term Weighting | Length Norm | Relevance Assumption |
|---|---|---|---|
| Boolean | None | None | Exact match |
| VSM | TF-IDF (Cosine) | Cosine Norm | Similarity in Vector Space |
| BIM | Probabilistic (IDF-like) | None | Binary / Naïve Bayes |
| BM25 | Adv. TF-IDF | based | PRP optimized |
| LM | Smoothing-based | Implicit in Smoothing | Query Generation Probability |