Doubly Robust Estimation
Definition
Doubly Robust (DR) Estimation is a technique that combines two estimation methods—a direct method (learned model) and inverse propensity weighting (IPW)—in a way that is unbiased if either component is correct.
In the context of ranking:
Intuition
The Problem We’re Solving
We have two approaches to unbiased estimation:
-
Direct Method (DM): Learn a relevance model , predict for all items
- ✓ Low variance (no clicks needed)
- ✗ Biased if model is wrong
-
Inverse Propensity Weighting (IPS): Reweight observed clicks
- ✓ Unbiased (theoretically)
- ✗ High variance (when propensities are low)
Goal: Combine them to get the best of both.
The Key Insight
The decomposition:
The second term (error of DM) is the treatment residual. We can estimate this residual using IPS:
If DM is perfect, residual is zero → no need for IPS correction.
If DM is wrong, IPS fixes it with unbiased residual correction.
Mathematical Formulation
The DR Estimator
For ranking evaluation:
Where:
- = predicted relevance of item at position in ranking
- = observed click (0 or 1)
- = examination propensity
Variance Analysis
DM variance:
IPS variance:
When propensities are low, IPS variance explodes.
DR variance:
Key: If DM is decent (error is small), DR variance is much lower than IPS.
Why “Doubly Robust”?
Unbiasedness Guarantee
DR is unbiased if either:
- The direct method is correct: for all , OR
- The propensity model is correct: is accurate
You get unbiasedness even if one is wrong (as long as the other is right).
This is the “doubly robust” property—dual sources of insurance.
Proof Sketch
Assume true value is and we have an item at position :
Case 1: DM is correct ()
For observed items: ✓
Case 2: Propensities are correct (regardless of ) ✓
Practical Advantages
1. Robustness
If your propensity estimates are noisy, DR still works if the learned model is decent.
If your learned model is noisy, DR still works if propensities are accurate.
Mutual insurance policy.
2. Variance Control
By leveraging a learned model, DR dramatically reduces the variance compared to pure IPS:
IPS Variance: ▓▓▓▓▓▓▓▓▓ (high)
DR Variance: ▓▓▓ (low)
Accuracy: Both unbiased
3. Scalability
DM can be applied to all items (even those not in logs), so:
- Evaluate new rankings with unseen items
- No zero-propensity problem
4. Easy Implementation
Just fit a relevance model, get propensities, apply the formula.
Comparison: IPS vs DM vs DR
| Aspect | IPS | Direct Method | Doubly Robust |
|---|---|---|---|
| Requires correct propensities | ✓ Required | ✗ Not needed | ◐ Helps but not required |
| Requires correct model | ✗ Not needed | ✓ Required | ◐ Helps but not required |
| Variance | High | Low | Low |
| Bias | Low | High | Low |
| Can handle new items | ✗ (if zero prop) | ✓ | ✓ |
| Practical deployment | Risky | Good | Best |
Design Choices
1. Which Model to Use?
Common choices for the direct method:
- Click models (RegressionEM, EM-based PBM)
- Two-tower neural networks
- LambdaMART or other learned-to-rank models
Principle: Use a model that captures relevance well but doesn’t overfit to position.
2. Treatment of Observed vs. Unobserved
Option A: Only apply IPS correction to observed items
Option B: Include counterfactual for unobserved items
Most common: Option A (only correct for observed).
3. Clipping & Regularization
Even in DR, extreme propensities can cause instability:
Common:
When DR Fails
1. Both Components Wrong
If the learned model and propensities are both misspecified, DR is biased.
Reality:
Items are relevant if they match user intent
Position bias is [100%, 80%, 60%, ...]
DR Model:
Learned relevance from clicks (conflates popularity with relevance)
Propensity estimates are wrong (estimated [90%, 70%, 50%, ...])
Result: DR is doubly wrong
2. High Correlation in Errors
If DM errors are correlated with low propensities (confounding), DR can amplify bias.
3. Severe Identifiability Issues
If the click model is unidentifiable (multiple solutions), which one was used in DM?
Different training initializations might converge to different models, each with valid likelihood but different biases.
Implementation Considerations
Algorithm: Training with DR
Input: Historical logs D with (query, ranking, clicks)
Examination propensities P(Exam_k)
1. Fit direct method (e.g., click model or neural network)
on click data to get relevance estimates r̂_d
2. For evaluation:
For new ranking y with items [d_1, ..., d_n]:
DCG = 0
for position k = 1 to n:
if (d_k, k) in historical logs:
click_kdk = observed click value
else:
click_dk = 0 (counterfactual: no observation)
dcg_contrib = r̂_dk / log2(k+1)
if observed:
correction = (r̂_dk - click_dk) / P(Exam_k) / log2(k+1)
dcg_contrib -= correction
DCG += dcg_contrib
return DCG
3. Optimize a new ranking model on DR signals (e.g., via gradient descent)
Propensity Smoothing
In practice, propensity estimates are noisy. Smooth them:
P(Exam_k) = (counts_k + α) / (total + α · #ranks)
Adds pseudocounts to prevent extreme estimates.
Variants & Extensions
Normalized DR
Some formulations normalize by propensities:
Trimmed DR
Remove observations with extreme propensities:
Augmented IPW
A related technique that augments IPS with an outcome model:
Similar in spirit to DR.
Connections
- Foundation: Combines Inverse Propensity Weighting + learned model
- Click Models: Provides the direct method component
- Causal Inference: Core technique in observational causal inference
- Counterfactual Evaluation: Used in Counterfactual Learning to Rank
- Off-Policy Learning: Applied to ranking from logged interactions
Appears In
- Unbiased Learning to Rank
- Counterfactual Learning to Rank
- Click Models (when combined with IPS)
- Production ranking systems (A/B testing, offline evaluation)