Chapter 7: Retrieval Models

7.1 Overview of Retrieval Models

Retrieval models provide a mathematical framework to formalize the process of deciding if a piece of text is relevant to an information need. Good models produce rankings that correlate with human relevance decisions, leading to high effectiveness.

Relevance

Topical Relevance: Whether a document and query are “about the same thing.”

User Relevance: A broader concept incorporating factors like age, language, novelty, and target audience.

Binary vs. Multi-valued: While relevance is often multi-level in reality, many models assume binary relevance (relevant/not relevant) for simplicity, often calculating a probability of relevance to represent uncertainty.

7.1.1 Boolean Retrieval

The oldest model, also known as exact-match retrieval. Documents are retrieved only if they exactly match the query specification.

Outcome: Binary (Matches or doesn’t). No inherent ranking among retrieved documents.
Operators: AND, OR, NOT, proximity operators, wildcards.
Pros: Predictable, easy to explain, efficient, allows complex metadata filtering.
Cons: Effectiveness depends entirely on the user; “searching by numbers” (too many or too few results); lacks term weighting.

7.1.2 The Vector Space Model (VSM)

The Vector Space Model represents documents and queries as vectors in a $t$ -dimensional space, where $t$ is the number of index terms.

Cosine Similarity

The most successful similarity measure for ranking in VSM: $C os in e (D_{i}, Q) = \frac{\sum _{j = 1}^{t} d _{ij} \cdot q _{j}}{\sum _{j = 1}^{t} d _{ij}^{2} \cdot \sum _{j = 1}^{t} q _{j}^{2}}$ Where $d_{ij}$ is the weight of term $j$ in document $i$ , and $q_{j}$ is the weight in the query.

Term Weighting (TF-IDF):

Term Frequency (tf): Reflects term importance within a document.
- Standard: $t f_{ik} = \frac{f _{ik}}{\sum f _{ij}}$
- Log-normalized (to reduce impact of frequent terms): $lo g (f_{ik}) + 1$
Inverse Document Frequency (idf): Reflects term discriminative power in the collection.
- $i df k = lo g \frac{N}{n _{k}}$
- Where $N$ is total documents and $n_{k}$ is the number of documents containing term $k$ .

TF-IDF

A term is important if it occurs frequently in a specific document but rarely across the rest of the collection.

Rocchio Algorithm (Relevance Feedback): Used to modify the query vector $Q$ into $Q^{'}$ based on relevant and non-relevant sets: $q_{j}^{'} = α \cdot q_{j} + β \cdot \frac{1}{∣ R e l ∣} \sum_{D_{i} \in R e l} d_{ij} - γ \cdot \frac{1}{∣ N o n re l ∣} \sum_{D_{i} \in N o n re l} d_{ij}$

7.2 Probabilistic Models

The dominant paradigm today, based on representing the uncertainty inherent in Information Retrieval.

7.2.1 Probability Ranking Principle (PRP)

Principle

If a system ranks documents in order of decreasing probability of relevance, the overall effectiveness of the system will be the best obtainable.

7.2.2 Binary Independence Model (BIM)

Treats IR as a classification problem. Documents are viewed as binary vectors (presence/absence of terms).

Assumptions:

Binary Relevance.
Naïve Bayes Assumption: Terms occur independently in relevant and non-relevant sets.

BIM Scoring Function

Derived from the likelihood ratio and Bayes’ Rule: $S core = \sum_{i : d_{i} = 1} lo g \frac{p _{i} ( 1 - s _{i} )}{s _{i} ( 1 - p _{i} )}$ Where $p_{i} = P (d_{i} = 1∣ R)$ and $s_{i} = P (d_{i} = 1∣ NR)$ .

In the absence of relevance information, the weight simplifies to an IDF-like variant: $lo g \frac{N - n _{i}}{n _{i}}$ .

7.2.3 The BM25 Ranking Algorithm

BM25 (Best Match 25) is a robust and widely used ranking algorithm that extends Binary Independence Model by adding term frequency and length normalization.

BM25 Score

$BM 25 (Q, D) = \sum_{i \in Q} lo g \frac{( r _{i} + 0.5 ) / ( R - r _{i} + 0.5 )}{( n _{i} - r _{i} + 0.5 ) / ( N - n _{i} - R + r _{i} + 0.5 )} \cdot \frac{( k _{1} + 1 ) f _{i}}{K + f _{i}} \cdot \frac{( k _{2} + 1 ) q f _{i}}{k _{2} + q f _{i}}$ Where:

$K = k_{1} ((1 - b) + b \cdot \frac{d l}{a v d l})$

$f_{i}$ : term frequency in document.

$q f_{i}$ : term frequency in query.

$d l$ : document length; $a v d l$ : average document length.

$k_{1}, k_{2}, b$ : empirical parameters (typical: $k_{1} = 1.2, b = 0.75$ ).

Parameter k1

$k_{1}$ controls tf saturation. As $f_{i}$ increases, the marginal contribution of additional occurrences decreases.

7.3 Ranking Based on Language Models

Treats documents as being generated from a Language Model for IR (a probability distribution over words).

7.3.1 Query Likelihood Model

Ranks documents by the probability of generating the query text from the document’s language model $M_{D}$ . $P (Q ∣ D) = \prod_{i = 1}^{n} P (q_{i} ∣ M_{D})$

7.3.2 Smoothing

Critical to avoid zero probabilities for missing query terms and to improve estimation.

Jelinek-Mercer Smoothing

A linear interpolation between the document MLE and the collection model: $P (q_{i} ∣ D) = (1 - λ) \frac{f _{q_{i}, D}}{∣ D ∣} + λ \frac{c _{q_{i}}}{∣ C ∣}$

Dirichlet Smoothing

Uses a document-length dependent weighting, generally more effective for short queries: $P (q_{i} ∣ D) = \frac{f _{q_{i}, D} + μ \frac{c _{q_{i}}}{∣ C ∣}}{∣ D ∣ + μ}$ Best results in TREC usually seen with $μ \approx 1000 - 2000$ .

7.3.3 Relevance Models and KL-Divergence

Generalizes the Query Likelihood Model by comparing two probability distributions: the Relevance Model ( $R$ ) and the Document Model ( $D$ ).

KL-Divergence Ranking

$S core \propto \sum_{w \in V} P (w ∣ R) lo g P (w ∣ D)$ This framework provides a formal basis for pseudo-relevance feedback and query expansion.

Comparison Summary

Model	Term Weighting	Length Norm	Relevance Assumption
Boolean	None	None	Exact match
VSM	TF-IDF (Cosine)	Cosine Norm	Similarity in Vector Space
BIM	Probabilistic (IDF-like)	None	Binary / Naïve Bayes
BM25	Adv. TF-IDF	$a v d l$ based	PRP optimized
LM	Smoothing-based	Implicit in Smoothing	Query Generation Probability

Study Notes

Explorer

IR-SEIRiP Ch7 - Retrieval Models

Chapter 7: Retrieval Models

7.1 Overview of Retrieval Models

7.1.1 Boolean Retrieval

7.1.2 The Vector Space Model (VSM)

7.2 Probabilistic Models

7.2.1 Probability Ranking Principle (PRP)

7.2.2 Binary Independence Model (BIM)

7.2.3 The BM25 Ranking Algorithm

7.3 Ranking Based on Language Models

7.3.1 Query Likelihood Model

7.3.2 Smoothing

7.3.3 Relevance Models and KL-Divergence

Comparison Summary

Graph View

Table of Contents

Backlinks