IR Lecture 4: IR Evaluation
Overview
IR evaluation measures how effectively a system matches users with relevant information. Since user satisfaction is hard to measure directly, relevance is the primary proxy. Evaluation follows the scientific method: design a system, run retrieval, compare results against human judgments.
1. The Cranfield Paradigm
Cranfield Paradigm
The standard framework for IR evaluation (Cleverdon, 1960s). Ensures comparability and repeatability using static test collections.
Components of a test collection:
- Corpus (Documents): Representative collection of documents
- Topics (Queries): Set of information needs (usually 50+ for statistical significance)
- Relevance Judgments (Qrels): Ground truth — which documents are relevant to which queries
Depth-k Pooling
Judging every document in a large corpus for every query is infeasible. Instead:
- Take the top- results from multiple different retrieval systems
- Union these results and remove duplicates
- Have human judges assess only the pooled documents
- Assumption: Unjudged documents are considered not relevant
Pool Bias
If a new system retrieves documents no pooled system found, those docs won’t have judgments. Leave-one-out tests: remove one system from pool, re-evaluate, check if system ranking is stable (using Kendall’s correlation).
2. Set-Based Metrics
These treat retrieved documents as an unordered set.
Precision and Recall
where = relevant retrieved, = non-relevant retrieved, = relevant not retrieved.
Precision-Recall Trade-off
Retrieving more documents increases recall but often decreases precision. The optimal balance depends on the task: legal search needs high recall; web search needs high precision.
F-Measure
Harmonic mean of Precision and Recall:
- : balanced ():
- : emphasizes recall
- : emphasizes precision
3. Rank-Based Metrics
IR systems return ranked lists — we need metrics that reward relevant documents at the top.
Precision at K (P@K)
Simple and intuitive. Ignores documents below rank and doesn’t consider order within top .
Mean Average Precision (MAP)
Average Precision (AP)
where if document at rank is relevant.
Worked Example
Ranking: [R, N, R, N, R] (R=relevant, N=non-relevant), 3 relevant total.
- ✓
- (not relevant, skip)
- ✓
- (not relevant, skip)
- ✓
Mean Reciprocal Rank (MRR)
Measures where the first relevant document appears. Good for navigational queries / QA.
4. Graded Relevance: DCG and NDCG
Binary relevance (relevant/not relevant) is often too coarse. Graded relevance allows degrees: e.g., 0=not relevant, 1=somewhat, 2=highly, 3=perfectly relevant.
Discounted Cumulative Gain (DCG)
DCG@K
where:
- — relevance grade at position
- — gain (exponential: highly relevant docs contribute much more)
- — discount (lower positions contribute less)
Normalized DCG (NDCG)
NDCG@K
where IDCG = DCG of the ideal (perfect) ranking. Normalizes to .
Worked NDCG Example
Ranking: [rel=3, rel=2, rel=0, rel=1],
Ideal: [3, 2, 1, 0]:
5. User Browsing Models
More sophisticated metrics model user behavior:
Rank-Biased Precision (RBP)
RBP
User views rank 1, continues to next with probability (persistence). Higher = more patient user.
Expected Reciprocal Rank (ERR)
ERR
where is the probability of being satisfied at position .
Models cascade behavior: once a user is satisfied, they stop browsing. A highly relevant document at rank 2 reduces the value of a relevant document at rank 3.
6. Evaluating RAG Systems
For Retrieval-Augmented Generation systems, evaluation splits into:
Retrieval component: Standard IR metrics (NDCG, MAP, Recall)
Generation component:
- Faithfulness / Groundedness: Is the answer supported by retrieved documents?
- Answer relevance: Does the answer address the query?
- Nugget-based evaluation: Does the answer contain key information nuggets?
- LLM-as-a-judge: Using an LLM to evaluate answer quality
7. Statistical Significance
Always Test Significance
A 2% improvement in MAP might be noise. Use statistical tests to confirm results are meaningful.
Common tests:
- Paired t-test: Compare system A vs system B across queries. : no difference.
- Wilcoxon signed-rank test: Non-parametric alternative
- Bootstrap test: Resample queries, compute metric many times
- : Standard threshold for significance
Kendall’s : Measures agreement between two system rankings (used for pool reusability).