Chapter 7: Retrieval Models

7.1 Overview of Retrieval Models

Retrieval models provide a mathematical framework to formalize the process of deciding if a piece of text is relevant to an information need. Good models produce rankings that correlate with human relevance decisions, leading to high effectiveness.

Relevance

  • Topical Relevance: Whether a document and query are “about the same thing.”
  • User Relevance: A broader concept incorporating factors like age, language, novelty, and target audience.
  • Binary vs. Multi-valued: While relevance is often multi-level in reality, many models assume binary relevance (relevant/not relevant) for simplicity, often calculating a probability of relevance to represent uncertainty.

7.1.1 Boolean Retrieval

The oldest model, also known as exact-match retrieval. Documents are retrieved only if they exactly match the query specification.

  • Outcome: Binary (Matches or doesn’t). No inherent ranking among retrieved documents.
  • Operators: AND, OR, NOT, proximity operators, wildcards.
  • Pros: Predictable, easy to explain, efficient, allows complex metadata filtering.
  • Cons: Effectiveness depends entirely on the user; “searching by numbers” (too many or too few results); lacks term weighting.

7.1.2 The Vector Space Model (VSM)

The Vector Space Model represents documents and queries as vectors in a -dimensional space, where is the number of index terms.

Cosine Similarity

The most successful similarity measure for ranking in VSM: Where is the weight of term in document , and is the weight in the query.

Term Weighting (TF-IDF):

  1. Term Frequency (tf): Reflects term importance within a document.
    • Standard:
    • Log-normalized (to reduce impact of frequent terms):
  2. Inverse Document Frequency (idf): Reflects term discriminative power in the collection.
    • Where is total documents and is the number of documents containing term .

TF-IDF

A term is important if it occurs frequently in a specific document but rarely across the rest of the collection.

Rocchio Algorithm (Relevance Feedback): Used to modify the query vector into based on relevant and non-relevant sets:


7.2 Probabilistic Models

The dominant paradigm today, based on representing the uncertainty inherent in Information Retrieval.

7.2.1 Probability Ranking Principle (PRP)

Principle

If a system ranks documents in order of decreasing probability of relevance, the overall effectiveness of the system will be the best obtainable.

7.2.2 Binary Independence Model (BIM)

Treats IR as a classification problem. Documents are viewed as binary vectors (presence/absence of terms).

Assumptions:

  • Binary Relevance.
  • Naïve Bayes Assumption: Terms occur independently in relevant and non-relevant sets.

BIM Scoring Function

Derived from the likelihood ratio and Bayes’ Rule: Where and .

In the absence of relevance information, the weight simplifies to an IDF-like variant: .

7.2.3 The BM25 Ranking Algorithm

BM25 (Best Match 25) is a robust and widely used ranking algorithm that extends Binary Independence Model by adding term frequency and length normalization.

BM25 Score

Where:

  • : term frequency in document.
  • : term frequency in query.
  • : document length; : average document length.
  • : empirical parameters (typical: ).

Parameter k1

controls tf saturation. As increases, the marginal contribution of additional occurrences decreases.


7.3 Ranking Based on Language Models

Treats documents as being generated from a Language Model for IR (a probability distribution over words).

7.3.1 Query Likelihood Model

Ranks documents by the probability of generating the query text from the document’s language model .

7.3.2 Smoothing

Critical to avoid zero probabilities for missing query terms and to improve estimation.

Jelinek-Mercer Smoothing

A linear interpolation between the document MLE and the collection model:

Dirichlet Smoothing

Uses a document-length dependent weighting, generally more effective for short queries: Best results in TREC usually seen with .

7.3.3 Relevance Models and KL-Divergence

Generalizes the Query Likelihood Model by comparing two probability distributions: the Relevance Model () and the Document Model ().

KL-Divergence Ranking

This framework provides a formal basis for pseudo-relevance feedback and query expansion.


Comparison Summary

ModelTerm WeightingLength NormRelevance Assumption
BooleanNoneNoneExact match
VSMTF-IDF (Cosine)Cosine NormSimilarity in Vector Space
BIMProbabilistic (IDF-like)NoneBinary / Naïve Bayes
BM25Adv. TF-IDF basedPRP optimized
LMSmoothing-basedImplicit in SmoothingQuery Generation Probability