Counterfactual Learning to Rank

Definition

Counterfactual Learning to Rank (CLTR) is the problem of training a ranking model using interaction data (clicks, conversions, etc.) from a different ranking system (the logging policy), while accounting for the biases introduced by that different system.

Motivation: We want to evaluate and improve a new ranking policy using historical data collected under an old (possibly inferior) policy.

Old Policy (Logging Policy)
    ↓
  (logs interactions)
    ↓
New Policy (Counterfactual)
    ↓
  (evaluate/train using old data)

The “counterfactual” aspect: We’re asking “what would happen if we deployed the new policy?” using data from when the old policy was active.

Intuition

The Problem

Imagine a search engine:

  • Current ranking system: Ranks by PageRank
  • New ranking system: Ranks by neural learned model
  • Historical logs: Clicks collected under PageRank (biased by it)
  • Goal: Can we use PageRank clicks to evaluate/train the neural model?

Naive approach: Train directly on the clicks
Result: The neural model learns that “PageRank-favored items are good,” not that “these items are truly relevant”

Better approach: Use the fact that clicks are biased by position, correct for that bias using Inverse Propensity Weighting, and train an unbiased model.

Core Insight

The key is to separate:

  • Click data = function of (old policy’s ranking, true relevance, Position Bias)
  • True relevance = what we want to learn
  • Old policy’s ranking = what we want to correct for

By modeling and removing the effect of the old policy + position bias, we recover true relevance.

Mathematical Framework

The Counterfactual Setup

Available data: Interactions from logging policy

Goal: Train a new policy

Challenge: We only have data from , not .

Unbiased Evaluation via IPS

To evaluate using data from :

Where:

  • = ranking shown under
  • = ranking from new policy
  • = click observed in logs
  • = examination propensity under ‘s ranking

The indicator ensures we only evaluate on items we have data for.

Doubly Robust Formulation

Combine a learned relevance model with Inverse Propensity Weighting:

First term: Direct prediction for all items
Second term: IPS correction on prediction errors

Result: Unbiased if either the relevance model OR the propensities are correct.

Key Problem: Off-Policy Bias

The Mismatch

Data Generation (Logging Policy π₀)
    Query q
      ↓
    Ranking y₀ ← π₀(q)
      ↓
    Position Bias: Exam_k
      ↓
    True Relevance: Rel_d
      ↓
    Click: Exam_k · Rel_d
      ↓
    Historical Log

---

Evaluation (New Policy π₁)
    Query q
      ↓
    Ranking y₁ ← π₁(q)
      ↓
    We want to know: Reward(y₁)
    But we have: Data from y₀!

If and are different, we need to account for:

  1. Position bias in : Items at high positions in got more clicks
  2. Different positions in : Items appear at different positions in
  3. Items not in : If contains items not in , we have no direct data

Selection Bias

Even if we correctly estimate relevance, we need to account for the fact that the new policy makes different choices:

  • Old policy ranked [A, B, C, D, E]
  • New policy ranks [B, A, E, D, C]
  • Clicks on A in position 1 (old) don’t directly tell us about A in position 2 (new)

This is Inverse Propensity Weighting: we need position bias estimates for the new ranking too.

Solutions

Solution 1: Online Randomization

The most direct approach: deploy the new policy online with some randomization, collect new data, and train.

Advantages:

  • Gets true data under the new policy
  • No need for complex counterfactual machinery

Disadvantages:

  • Requires online deployment (risky, resource-intensive)
  • Harms user experience during collection
  • Slow feedback loop

Solution 2: Offline Evaluation with IPS

Use historical logs with careful IPS weighting:

  1. Estimate position bias from historical logs (or randomization data)
  2. Correct for bias using IPS:
  3. Train new model on unbiased relevance estimates
  4. Evaluate on held-out test set using IPS

Advantages:

  • Offline, no user impact
  • Theoretically sound

Disadvantages:

  • High variance (due to IPS)
  • Depends on accurate propensity estimation
  • Fails for zero-propensity items

Solution 3: Doubly Robust Estimation

Combine IPS with a learned relevance model:

  1. Estimate position bias from logs
  2. Train relevance model via click models or Doubly Robust Estimation
  3. Evaluate new policy using DR formula
  4. Train new model using evaluation signal

Advantages:

  • Lower variance than pure IPS
  • More robust to propensity errors
  • Practical and scalable

Disadvantages:

  • Requires learning two models
  • Still vulnerable to model misspecification

Practical Considerations

Propensity Estimation

Critical step: Accurately estimate for the logging policy.

Methods:

  • Online randomization (Ideal): Run RandTop-k or RandPair experiments
  • Intervention harvesting (Good): Use historical A/B tests
  • Click model (Reasonable): Fit EM-based Click Models

Handling Unseen Items

Items in that never appeared in have zero propensity.

Problem: IPS weight is
Solutions:

  • Use DR estimation (DM provides estimates for unseen items)
  • Restrict to items seen in logs
  • Use item features to extrapolate relevance

Multi-Step Deployment

In practice, counterfactual learning is often an iterative process:

  1. Collect baseline logs under policy
  2. Develop new policy using CLTR
  3. Deploy with some randomization
  4. Collect new logs under
  5. Use these new logs for CLTR to develop
  6. Repeat

Each step provides fresh, less-biased data for the next iteration.

Assumptions & Failures

Critical Assumptions

  1. Correct user model: Users behave as modeled (e.g., PBM)
  2. Stable relevance: Item relevance doesn’t change between logging and new policy
  3. Overlap: Items in new policy were seen in logs (or extrapolatable)
  4. No hidden confounders: Position is the only confounder

Common Failures

  • Wrong user model: Using PBM when cascading behavior dominates
  • Temporal drift: Item relevance changes between logs and deployment
  • Distribution shift: New policy queries/items are different from logged distribution
  • Cascading bias: IPS assuming PBM breaks under cascading

Real-World Example

Scenario: E-commerce search engine

Old System (Logging Policy)
    ↓
  Ranks by: BM25 + manual rules
    ↓
  Logs: 1M clicks on various queries
    ↓
  Observed patterns:
    - Items at rank 1 get ~20% CTR
    - Items at rank 5 get ~8% CTR
    - Position bias is strong
    ↓
  Estimate: Exam probabilities [1.0, 1.0, 0.9, 0.8, 0.7, ...]
    ↓

New System (Counterfactual)
    ↓
  Train neural ranker using:
    - Raw clicks (naive): Learns position bias ❌
    - IPS-corrected clicks: Learns true relevance ✓
    ↓
  Evaluate using DR:
    - Relevance model from neural ranker
    - IPS correction on errors
    - Estimate: New policy would improve DCG by X%
    ↓
  Deploy & measure: +X% in online A/B test

Connections

Appears In