Doubly Robust Estimation

Definition

Doubly Robust (DR) Estimation is a technique that combines two estimation methods—a direct method (learned model) and inverse propensity weighting (IPW)—in a way that is unbiased if either component is correct.

In the context of ranking:

$DCG_{DR} = Direct Method k \sum \overset{r}{^}_{y [k]} - IPS Correction k \sum \frac{( r ^ _{y [k]} - Click _{y [k], k} )}{P ( Exam _{k} )}$

Intuition

The Problem We’re Solving

We have two approaches to unbiased estimation:

Direct Method (DM): Learn a relevance model $\overset{r}{^}$ , predict for all items
- ✓ Low variance (no clicks needed)
- ✗ Biased if model is wrong
Inverse Propensity Weighting (IPS): Reweight observed clicks
- ✓ Unbiased (theoretically)
- ✗ High variance (when propensities are low)

Goal: Combine them to get the best of both.

The Key Insight

The decomposition:

$True value = DM prediction + (True value - DM prediction)$

The second term (error of DM) is the treatment residual. We can estimate this residual using IPS:

$True value \approx DM prediction + IPS estimate of residual$

If DM is perfect, residual is zero → no need for IPS correction.
If DM is wrong, IPS fixes it with unbiased residual correction.

Mathematical Formulation

The DR Estimator

$\overset{μ}{^}_{DR} = DM \frac{1}{n} i \sum \overset{μ}{^} (X_{i}) - IPS residual correction \frac{1}{n} i \sum \frac{μ ^ ( X _{i} ) - Y _{i}}{p ( X _{i} )} \cdot 1 [observed_{i}]$

For ranking evaluation:

$DCG_{DR} = DM: predicted value k = 1 \sum n \frac{r ^ _{y [k]}}{lo g _{2} ( k + 1 )} - IPS correction for prediction error k = 1 \sum n \frac{( r ^ _{y [k]} - Click _{y [k], k} )}{P ( Exam _{k} ) \cdot lo g _{2} ( k + 1 )}$

Where:

$\overset{r}{^}_{y [k]}$ = predicted relevance of item at position $k$ in ranking $y$
$Click_{y [k], k}$ = observed click (0 or 1)
$P (Exam_{k})$ = examination propensity

Variance Analysis

DM variance: $Var_{DM} = Var (\overset{r}{^})$

IPS variance: $Var_{IPS} = E [\frac{Var ( Click )}{p ^{2}}]$

When propensities are low, IPS variance explodes.

DR variance: $Var_{DR} \approx Var_{IPS} \cdot \frac{( DM error ) ^{2}}{( true value ) ^{2}}$

Key: If DM is decent (error is small), DR variance is much lower than IPS.

Why “Doubly Robust”?

Unbiasedness Guarantee

DR is unbiased if either:

The direct method is correct: $E [\overset{μ}{^} (X)] = μ (X)$ for all $X$ , OR
The propensity model is correct: $p (X)$ is accurate

You get unbiasedness even if one is wrong (as long as the other is right).

This is the “doubly robust” property—dual sources of insurance.

Proof Sketch

Assume true value is $μ$ and we have an item at position $k$ :

Case 1: DM is correct ( $\overset{μ}{^} = μ$ ) $DR = μ - \frac{( μ - Click )}{p} \cdot 1 [observed]$

For observed items: $E [DR] = μ - \frac{E [ μ - Click ]}{p} = μ - \frac{μ - μ}{p} = μ$ ✓

Case 2: Propensities are correct (regardless of $\overset{μ}{^}$ ) $E [DR] = E [\overset{μ}{^}] - E [\frac{μ ^ - Y}{p}] = E [\overset{μ}{^}] - (E [\overset{μ}{^}] - μ) = μ$ ✓

Practical Advantages

1. Robustness

If your propensity estimates are noisy, DR still works if the learned model is decent.
If your learned model is noisy, DR still works if propensities are accurate.

Mutual insurance policy.

2. Variance Control

By leveraging a learned model, DR dramatically reduces the variance compared to pure IPS:

IPS Variance:    ▓▓▓▓▓▓▓▓▓ (high)
DR Variance:     ▓▓▓ (low)
Accuracy:        Both unbiased

3. Scalability

DM can be applied to all items (even those not in logs), so:

Evaluate new rankings with unseen items
No zero-propensity problem

4. Easy Implementation

Just fit a relevance model, get propensities, apply the formula.

Comparison: IPS vs DM vs DR

Aspect	IPS	Direct Method	Doubly Robust
Requires correct propensities	✓ Required	✗ Not needed	◐ Helps but not required
Requires correct model	✗ Not needed	✓ Required	◐ Helps but not required
Variance	High	Low	Low
Bias	Low	High	Low
Can handle new items	✗ (if zero prop)	✓	✓
Practical deployment	Risky	Good	Best

Design Choices

1. Which Model to Use?

Common choices for the direct method:

Click models (RegressionEM, EM-based PBM)
Two-tower neural networks
LambdaMART or other learned-to-rank models

Principle: Use a model that captures relevance well but doesn’t overfit to position.

2. Treatment of Observed vs. Unobserved

Option A: Only apply IPS correction to observed items $DCG_{DR} = \sum_{k} \overset{r}{^}_{y [k]} - \sum_{k : observed} \frac{( r ^ _{y [k]} - Click _{y [k], k} )}{p _{k}}$

Option B: Include counterfactual for unobserved items $DCG_{DR} = \sum_{k} [\overset{r}{^}_{y [k]} - \frac{( r ^ _{y [k]} - Click _{y [k], k} )}{p _{k}} \cdot 1 [k \in observed]]$

Most common: Option A (only correct for observed).

3. Clipping & Regularization

Even in DR, extreme propensities can cause instability:

$w_{k} = min (\frac{1}{p _{k}}, w_{m a x})$

Common: $w_{m a x} = 5, 10, 20$

When DR Fails

1. Both Components Wrong

If the learned model and propensities are both misspecified, DR is biased.

Reality:
  Items are relevant if they match user intent
  Position bias is [100%, 80%, 60%, ...]

DR Model:
  Learned relevance from clicks (conflates popularity with relevance)
  Propensity estimates are wrong (estimated [90%, 70%, 50%, ...])

Result: DR is doubly wrong

2. High Correlation in Errors

If DM errors are correlated with low propensities (confounding), DR can amplify bias.

3. Severe Identifiability Issues

If the click model is unidentifiable (multiple solutions), which one was used in DM?

Different training initializations might converge to different models, each with valid likelihood but different biases.

Implementation Considerations

Algorithm: Training with DR

Input: Historical logs D with (query, ranking, clicks)
       Examination propensities P(Exam_k)

1. Fit direct method (e.g., click model or neural network)
   on click data to get relevance estimates r̂_d

2. For evaluation:
   For new ranking y with items [d_1, ..., d_n]:
     DCG = 0
     for position k = 1 to n:
       if (d_k, k) in historical logs:
         click_kdk = observed click value
       else:
         click_dk = 0  (counterfactual: no observation)
       
       dcg_contrib = r̂_dk / log2(k+1)
       if observed:
         correction = (r̂_dk - click_dk) / P(Exam_k) / log2(k+1)
         dcg_contrib -= correction
       
       DCG += dcg_contrib
   
   return DCG

3. Optimize a new ranking model on DR signals (e.g., via gradient descent)

Propensity Smoothing

In practice, propensity estimates are noisy. Smooth them:

P(Exam_k) = (counts_k + α) / (total + α · #ranks)

Adds pseudocounts to prevent extreme estimates.

Variants & Extensions

Normalized DR

Some formulations normalize by propensities:

$DR_{norm} = \frac{\sum _{k} r ^ _{y [k]} / p _{k}}{\sum _{k} 1/ p _{k}}$

Trimmed DR

Remove observations with extreme propensities:

$DR_{trim} = E_{k} [\cdot ∣ p_{k} > p_{m i n}]$

Augmented IPW

A related technique that augments IPS with an outcome model:

$AIPW = \overset{μ}{^} (X) - p (X) \cdot (residual)$

Similar in spirit to DR.

Connections

Foundation: Combines Inverse Propensity Weighting + learned model
Click Models: Provides the direct method component
Causal Inference: Core technique in observational causal inference
Counterfactual Evaluation: Used in Counterfactual Learning to Rank
Off-Policy Learning: Applied to ranking from logged interactions

Appears In

Unbiased Learning to Rank
Counterfactual Learning to Rank
Click Models (when combined with IPS)
Production ranking systems (A/B testing, offline evaluation)

Study Notes

Explorer

Doubly Robust Estimation

Doubly Robust Estimation

Definition

Intuition

The Problem We’re Solving

The Key Insight

Mathematical Formulation

The DR Estimator

Variance Analysis

Why “Doubly Robust”?

Unbiasedness Guarantee

Proof Sketch

Practical Advantages

1. Robustness

2. Variance Control

3. Scalability

4. Easy Implementation

Comparison: IPS vs DM vs DR

Design Choices

1. Which Model to Use?

2. Treatment of Observed vs. Unobserved

3. Clipping & Regularization

When DR Fails

1. Both Components Wrong

2. High Correlation in Errors

3. Severe Identifiability Issues

Implementation Considerations

Algorithm: Training with DR

Propensity Smoothing

Variants & Extensions

Normalized DR

Trimmed DR

Augmented IPW

Connections

Appears In

Graph View

Table of Contents

Backlinks