IR Lecture 4: IR Evaluation

Overview

IR evaluation measures how effectively a system matches users with relevant information. Since user satisfaction is hard to measure directly, relevance is the primary proxy. Evaluation follows the scientific method: design a system, run retrieval, compare results against human judgments.


1. The Cranfield Paradigm

Cranfield Paradigm

The standard framework for IR evaluation (Cleverdon, 1960s). Ensures comparability and repeatability using static test collections.

Components of a test collection:

  1. Corpus (Documents): Representative collection of documents
  2. Topics (Queries): Set of information needs (usually 50+ for statistical significance)
  3. Relevance Judgments (Qrels): Ground truth — which documents are relevant to which queries

Depth-k Pooling

Judging every document in a large corpus for every query is infeasible. Instead:

  1. Take the top- results from multiple different retrieval systems
  2. Union these results and remove duplicates
  3. Have human judges assess only the pooled documents
  4. Assumption: Unjudged documents are considered not relevant

Pool Bias

If a new system retrieves documents no pooled system found, those docs won’t have judgments. Leave-one-out tests: remove one system from pool, re-evaluate, check if system ranking is stable (using Kendall’s correlation).


2. Set-Based Metrics

These treat retrieved documents as an unordered set.

Precision and Recall

where = relevant retrieved, = non-relevant retrieved, = relevant not retrieved.

Precision-Recall Trade-off

Retrieving more documents increases recall but often decreases precision. The optimal balance depends on the task: legal search needs high recall; web search needs high precision.

F-Measure

Harmonic mean of Precision and Recall:

  • : balanced ():
  • : emphasizes recall
  • : emphasizes precision

3. Rank-Based Metrics

IR systems return ranked lists — we need metrics that reward relevant documents at the top.

Precision at K (P@K)

Simple and intuitive. Ignores documents below rank and doesn’t consider order within top .

Mean Average Precision (MAP)

Average Precision (AP)

where if document at rank is relevant.

Worked Example

Ranking: [R, N, R, N, R] (R=relevant, N=non-relevant), 3 relevant total.

  • (not relevant, skip)
  • (not relevant, skip)

Mean Reciprocal Rank (MRR)

Measures where the first relevant document appears. Good for navigational queries / QA.


4. Graded Relevance: DCG and NDCG

Binary relevance (relevant/not relevant) is often too coarse. Graded relevance allows degrees: e.g., 0=not relevant, 1=somewhat, 2=highly, 3=perfectly relevant.

Discounted Cumulative Gain (DCG)

DCG@K

where:

  • — relevance grade at position
  • gain (exponential: highly relevant docs contribute much more)
  • discount (lower positions contribute less)

Normalized DCG (NDCG)

NDCG@K

where IDCG = DCG of the ideal (perfect) ranking. Normalizes to .

Worked NDCG Example

Ranking: [rel=3, rel=2, rel=0, rel=1],

Ideal: [3, 2, 1, 0]:


5. User Browsing Models

More sophisticated metrics model user behavior:

Rank-Biased Precision (RBP)

RBP

User views rank 1, continues to next with probability (persistence). Higher = more patient user.

Expected Reciprocal Rank (ERR)

ERR

where is the probability of being satisfied at position .

Models cascade behavior: once a user is satisfied, they stop browsing. A highly relevant document at rank 2 reduces the value of a relevant document at rank 3.


6. Evaluating RAG Systems

For Retrieval-Augmented Generation systems, evaluation splits into:

Retrieval component: Standard IR metrics (NDCG, MAP, Recall)

Generation component:

  • Faithfulness / Groundedness: Is the answer supported by retrieved documents?
  • Answer relevance: Does the answer address the query?
  • Nugget-based evaluation: Does the answer contain key information nuggets?
  • LLM-as-a-judge: Using an LLM to evaluate answer quality

7. Statistical Significance

Always Test Significance

A 2% improvement in MAP might be noise. Use statistical tests to confirm results are meaningful.

Common tests:

  • Paired t-test: Compare system A vs system B across queries. : no difference.
  • Wilcoxon signed-rank test: Non-parametric alternative
  • Bootstrap test: Resample queries, compute metric many times
  • : Standard threshold for significance

Kendall’s : Measures agreement between two system rankings (used for pool reusability).


8. Summary: Metric Selection Guide

ScenarioRecommended Metric
Binary relevance, care about top resultsMAP
Graded relevanceNDCG
Finding first answer (QA, navigational)MRR
Quick sanity checkP@10
User model with persistenceRBP
User model with satisfactionERR