IR Lecture 3: Retrieval Models

1. Introduction to Retrieval Models

Retrieval models provide a mathematical framework for defining query-document matching. They include assumptions about relevance and serve as the basis for ranking algorithms.

Retrieval models are generally categorized into two paradigms:

  1. Lexical Matching: Matching based on word occurrences (terms).
    • Vector Space Model (VSM): TF-IDF weighting.
    • Probabilistic Models: BM25.
    • Language Models: Query Likelihood.
  2. Semantic Matching: Matching based on meaning/representations.
    • Distributed Representations: Neural models (e.g., word embeddings, BERT).

2. Vector Space Model & Term Weighting

In the Vector Space Model (VSM), documents and queries are represented as vectors in a high-dimensional space where each dimension corresponds to a term in the vocabulary.

2.1 Lexical Incidence Matrix

The simplest representation is binary, indicating presence (1) or absence (0) of a term. Similarity is often measured using Cosine Similarity:

2.2 Term Frequency (TF)

Raw frequency counts are better than binary indicators but have diminishing returns.

Retrieval Axioms: Term Frequency

  • TFC1: Higher score for documents with more query term occurrences.
  • TFC2 (Saturation): The increase in score due to TF should be sub-linear (the difference between 1 and 2 occurrences is more significant than the difference between 100 and 101).
  • TFC3: If total occurrences are equal, documents covering more distinct query terms should be preferred.

Common sub-linear transformation (logarithmic):

2.3 Inverse Document Frequency (IDF)

According to Zipf’s Law, a small set of words appear very frequently. These words (e.g., “the”, “and”) are poor discriminators.

Term Discrimination Constraint (TDC)

Terms popular across the entire collection should be penalized.

where:

  • — total number of documents in collection.
  • — number of documents containing term .

2.4 TF-IDF Weighting

The weight of a term in document is:


3. Probabilistic Models: BM25

BM25 (Best Matching 25) is the most widely used weighting function in IR. It effectively balances TF, IDF, and document length.

3.1 The BM25 Formula

For a query containing terms , the score for document is:

where:

  • — term frequency of in document .
  • — length of document (number of tokens).
  • — average document length in the collection.
  • — term frequency saturation parameter (typically to ). Controls how quickly TF effect saturates.
  • — length normalization parameter (typically ). is full normalization, is no normalization.

3.2 Intuition

  • Saturation: As increases, the score approaches a limit.
  • Length Normalization: Longer documents are expected to have higher TFs naturally; we normalize to avoid bias toward long documents that aren’t necessarily more relevant.

4. Language Models for IR (LMIR)

Instead of matching vectors, we treat retrieval as a generative process. We estimate a Language Model for each document and ask: “What is the probability that this document model generated the query?“

4.1 Query Likelihood Model

The documents are ranked by the probability . Using Bayes’ Rule: Assuming a uniform prior , we rank by .

Under the Multinomial Assumption (terms generated independently): In log-space (to avoid underflow):

4.2 Estimation and Smoothing

The Maximum Likelihood Estimate (MLE) for a term in a document is:

Problem: If a query term is missing from document , , making the entire product 0. Solution: Smoothing — adjusting estimates to avoid zero probabilities and incorporate background knowledge (collection model ).

Smoothing Methods

  1. Jelinek-Mercer (Linear Interpolation):

    • Small (e.g., 0.1) highlights document-specific content (precision).
    • Large (e.g., 0.7) better for long queries (recall).
  2. Dirichlet Prior Smoothing:

    • is the smoothing parameter (often ).
    • It performs “Bayesian” smoothing where the amount of smoothing depends on the document length .
  3. Absolute Discounting: Subtracts a constant from seen counts and redistributes it to unseen terms proportional to .


5. Model Comparison Summary

ModelFoundationKey Components
VSM (TF-IDF)GeometryTF, IDF, Cosine Sim
BM25Probabilistic (BIM)Saturation-TF, IDF, Length Norm
LMIRProbability/GenerativeTerm distribution, Smoothing

Choice of Model

BM25 and LMIR with Dirichlet smoothing are generally state-of-the-art for lexical retrieval and perform similarly in practice. BM25 is easier to tune (parameters ), while LMIR is more theoretically grounded for extensions (e.g., translation models).


6. Optimization: Skip Pointers

To speed up query processing, inverted lists contain skip pointers.

  • Allow jumping over large portions of the list if the current document is smaller than the document being evaluated on another list.
  • Trade-off: More skips smaller data read, but more overhead in pointer storage. Optimal skip distance is typically around 100 bytes.

7. Update Strategies

  • Index Merging: Create small new index and merge with old one periodically.
  • Geometric Partitioning: Maintain multiple indexes of increasing size (). When reaches limit, merge into .