IR Lecture 4: IR Evaluation

Overview

IR evaluation measures how effectively a system matches users with relevant information. Since user satisfaction is hard to measure directly, relevance is the primary proxy. Evaluation follows the scientific method: design a system, run retrieval, compare results against human judgments.

1. The Cranfield Paradigm

Cranfield Paradigm

The standard framework for IR evaluation (Cleverdon, 1960s). Ensures comparability and repeatability using static test collections.

Components of a test collection:

Corpus (Documents): Representative collection of documents
Topics (Queries): Set of information needs (usually 50+ for statistical significance)
Relevance Judgments (Qrels): Ground truth — which documents are relevant to which queries

Depth-k Pooling

Judging every document in a large corpus for every query is infeasible. Instead:

Take the top- $k$ results from multiple different retrieval systems
Union these results and remove duplicates
Have human judges assess only the pooled documents
Assumption: Unjudged documents are considered not relevant

Pool Bias

If a new system retrieves documents no pooled system found, those docs won’t have judgments. Leave-one-out tests: remove one system from pool, re-evaluate, check if system ranking is stable (using Kendall’s $τ$ correlation).

2. Set-Based Metrics

These treat retrieved documents as an unordered set.

Precision and Recall

$Precision = \frac{∣ Rel \cap Ret ∣}{∣ Ret ∣} = \frac{TP}{TP + FP}$ $Recall = \frac{∣ Rel \cap Ret ∣}{∣ Rel ∣} = \frac{TP}{TP + FN}$

where $TP$ = relevant retrieved, $FP$ = non-relevant retrieved, $FN$ = relevant not retrieved.

Precision-Recall Trade-off

Retrieving more documents increases recall but often decreases precision. The optimal balance depends on the task: legal search needs high recall; web search needs high precision.

F-Measure

Harmonic mean of Precision and Recall: $F_{β} = (1 + β^{2}) \cdot \frac{P \cdot R}{β ^{2} \cdot P + R}$

$F_{1}$ : balanced ( $β = 1$ ): $F_{1} = \frac{2 PR}{P + R}$

$β > 1$ : emphasizes recall

$β < 1$ : emphasizes precision

3. Rank-Based Metrics

IR systems return ranked lists — we need metrics that reward relevant documents at the top.

Precision at K (P@K)

$P @ K = \frac{relevant documents in top K}{K}$

Simple and intuitive. Ignores documents below rank $K$ and doesn’t consider order within top $K$ .

Mean Average Precision (MAP)

Average Precision (AP)

$AP = \frac{1}{∣ Rel ∣} \sum_{k = 1}^{n} P @ k \cdot rel (k)$

where $rel (k) = 1$ if document at rank $k$ is relevant.

$MAP = \frac{1}{∣ Q ∣} \sum_{j = 1}^{∣ Q ∣} AP (q_{j})$

Worked Example

Ranking: [R, N, R, N, R] (R=relevant, N=non-relevant), 3 relevant total.

$P @1 = 1/1 = 1.0$ ✓

$P @2 = 1/2 = 0.5$ (not relevant, skip)

$P @3 = 2/3 = 0.667$ ✓

$P @4 = 2/4 = 0.5$ (not relevant, skip)

$P @5 = 3/5 = 0.6$ ✓

$AP = \frac{1}{3} (1.0 + 0.667 + 0.6) = 0.756$

Mean Reciprocal Rank (MRR)

$MRR = \frac{1}{∣ Q ∣} \sum_{i = 1}^{∣ Q ∣} \frac{1}{rank _{i}}$

Measures where the first relevant document appears. Good for navigational queries / QA.

4. Graded Relevance: DCG and NDCG

Binary relevance (relevant/not relevant) is often too coarse. Graded relevance allows degrees: e.g., 0=not relevant, 1=somewhat, 2=highly, 3=perfectly relevant.

Discounted Cumulative Gain (DCG)

DCG@K

$DCG @ K = \sum_{i = 1}^{K} \frac{2 ^{rel_{i}} - 1}{l o g _{2} ( i + 1 )}$

where:

$rel_{i}$ — relevance grade at position $i$

$2^{rel_{i}} - 1$ — gain (exponential: highly relevant docs contribute much more)

$\frac{1}{l o g _{2} ( i + 1 )}$ — discount (lower positions contribute less)

Normalized DCG (NDCG)

NDCG@K

$NDCG @ K = \frac{DCG @ K}{IDCG @ K}$

where IDCG = DCG of the ideal (perfect) ranking. Normalizes to $[0, 1]$ .

Worked NDCG Example

Ranking: [rel=3, rel=2, rel=0, rel=1], $K = 4$

$DCG @4 = \frac{2 ^{3} - 1}{l o g _{2} 2} + \frac{2 ^{2} - 1}{l o g _{2} 3} + \frac{2 ^{0} - 1}{l o g _{2} 4} + \frac{2 ^{1} - 1}{l o g _{2} 5}$ $= \frac{7}{1} + \frac{3}{1.585} + \frac{0}{2} + \frac{1}{2.322} = 7 + 1.893 + 0 + 0.431 = 9.324$

Ideal: [3, 2, 1, 0]: $IDCG @4 = 7 + 1.893 + 0.5 + 0 = 9.393$

$NDCG @4 = 9.324/9.393 = 0.993$

5. User Browsing Models

More sophisticated metrics model user behavior:

Rank-Biased Precision (RBP)

RBP

$RBP = (1 - p) \sum_{i = 1}^{\infty} p^{i - 1} \cdot rel_{i}$

User views rank 1, continues to next with probability $p$ (persistence). Higher $p$ = more patient user.

Expected Reciprocal Rank (ERR)

ERR

$ERR = \sum_{r = 1}^{n} \frac{1}{r} \prod_{i = 1}^{r - 1} (1 - R_{i}) \cdot R_{i}$

where $R_{i} = \frac{2 ^{rel_{i}} - 1}{2 ^{rel_{m a x}}}$ is the probability of being satisfied at position $i$ .

Models cascade behavior: once a user is satisfied, they stop browsing. A highly relevant document at rank 2 reduces the value of a relevant document at rank 3.

6. Evaluating RAG Systems

For Retrieval-Augmented Generation systems, evaluation splits into:

Retrieval component: Standard IR metrics (NDCG, MAP, Recall)

Generation component:

Faithfulness / Groundedness: Is the answer supported by retrieved documents?
Answer relevance: Does the answer address the query?
Nugget-based evaluation: Does the answer contain key information nuggets?
LLM-as-a-judge: Using an LLM to evaluate answer quality

7. Statistical Significance

Always Test Significance

A 2% improvement in MAP might be noise. Use statistical tests to confirm results are meaningful.

Common tests:

Paired t-test: Compare system A vs system B across queries. $H_{0}$ : no difference.
Wilcoxon signed-rank test: Non-parametric alternative
Bootstrap test: Resample queries, compute metric many times
$p < 0.05$ : Standard threshold for significance

Kendall’s $τ$ : Measures agreement between two system rankings (used for pool reusability).

8. Summary: Metric Selection Guide

Scenario	Recommended Metric
Binary relevance, care about top results	MAP
Graded relevance	NDCG
Finding first answer (QA, navigational)	MRR
Quick sanity check	P@10
User model with persistence	RBP
User model with satisfaction	ERR

Study Notes

Explorer

IR-L04 - Evaluation

IR Lecture 4: IR Evaluation

Overview

1. The Cranfield Paradigm

Depth-k Pooling

2. Set-Based Metrics

3. Rank-Based Metrics

Precision at K (P@K)

Mean Average Precision (MAP)

Mean Reciprocal Rank (MRR)

4. Graded Relevance: DCG and NDCG

Discounted Cumulative Gain (DCG)

Normalized DCG (NDCG)

5. User Browsing Models

Rank-Biased Precision (RBP)

Expected Reciprocal Rank (ERR)

6. Evaluating RAG Systems

7. Statistical Significance

8. Summary: Metric Selection Guide

Graph View

Table of Contents

Backlinks