Unbiased Learning to Rank
Definition
Unbiased Learning to Rank (ULTR)
Unbiased Learning to Rank is the family of methods that train a ranking model from implicit feedback (clicks) while correcting for the systematic biases that clicks carry — above all Position Bias. The goal is to recover the true relevance signal from observed clicks, which are a biased function of relevance, position, and presentation.
The defining contract: produce an estimator of ranking quality (e.g. DCG) whose expectation over clicks equals the true relevance-based metric, so that minimising the empirical loss optimises the model toward genuine relevance rather than toward “whatever the logging system already ranked highly.”
Intuition
Clicks are cheap and abundant, but they lie. A user clicks an item only if they examine it and find it relevant (the Examination Hypothesis). Because users scan top-down, items at high ranks are examined far more often, so they collect clicks regardless of true relevance. Training naively on raw clicks therefore teaches the model to reproduce the logging policy’s ranking, not relevance — a self-reinforcing feedback loop.
ULTR breaks the loop by treating clicks as observations drawn through a known bias model (the propensities). If we know how a click was distorted, we can mathematically undo the distortion. The workhorse is Inverse Propensity Weighting: a click at a rarely-examined low rank is worth more relevance evidence than a click at rank 1, because it survived a much harsher examination filter.
Two complementary stances:
- Counterfactual / offline: reuse historical logs from a fixed logging policy and reweight (Counterfactual Learning to Rank).
- Online: actively intervene (randomise) to measure propensities, then learn.
Mathematical Formulation
Click generation (PBM). Under the Examination Hypothesis with the Position-Based Click Model, a click on document shown at rank factorises:
where:
- — observed click on at rank (binary)
- — examination event, depends only on rank (position bias)
- — true relevance of to the query, independent of rank
- — the propensity at rank
Unbiased relevance estimate (IPW). Dividing the click by its propensity removes the position-dependent factor in expectation:
IPW additive ranking loss. A model producing scores is trained on the clicked documents only, each weighted by its inverse propensity. For an additive metric (rank-based loss , e.g. the rank of in ‘s ranking):
where:
- — set of logged query sessions
- — documents that received a click in that session
- — the rank at which was displayed by the logging policy
- — loss for placing the relevant at the rank assigned by (e.g. a Pairwise Learning to Rank hinge / LambdaRank-style NDCG term)
- — examination propensity at the displayed rank
Unbiasedness of the IPW loss
The expectation over clicks reduces to the ideal full-information loss computed on truly relevant documents — the exactly cancels the examination factor, so is optimised toward true relevance.
Positivity is mandatory
Every displayed rank must have , otherwise the weight is undefined and that position contributes no gradient. Top- logging (ranks beyond never shown) violates this; it forces either stochastic logging policies, propensity clipping, or a Doubly Robust Estimation fallback.
Key Properties / Variants
- Bias–variance tension. IPW is unbiased but high-variance: small at deep ranks inflates weights and amplifies click noise. Propensity clipping trades a little bias for large variance reduction.
- Where the propensities come from.
- Online randomisation (gold standard): RandTop- or RandPair swaps measure directly.
- Intervention harvesting: mine naturally-occurring rank swaps in historical A/B logs.
- Jointly learned (Dual Learning): estimate the propensity model and the ranker together via EM-style alternation, no interventions needed.
- User-model variants beyond PBM. PBM assumes examination depends only on rank. Richer biases need richer models: Cascading Position Bias (sequential scan that stops on satisfaction), Trust Bias (top items clicked even when irrelevant), Item Selection Bias (items never shown get no feedback), Outlier Bias (visually distinctive items over-examined).
- Doubly Robust ULTR. Combine a learned relevance/imputation model with the IPW correction: unbiased if either the propensities or the imputation model is correct, and lower variance than pure IPW (Doubly Robust Estimation).
- Relation to causal/off-policy RL. ULTR is off-policy evaluation: the logging policy is the behaviour policy, the new ranker is the target policy, and IPW is Importance Sampling over rankings.
Algorithm: Counterfactual ULTR with IPW (offline)
──────────────────────────────────────────────────
Input: click logs {(q, ranking y0, clicks c)}, propensities p_k
Output: ranking model f_θ
# 1) Estimate / load propensities
Estimate p_k for each rank k # randomization OR intervention harvesting
# OR jointly via EM (Dual Learning)
# 2) Train ranker on IPW-weighted clicks
Initialize θ
Loop until converged:
Sample session (q, y0, c) from logs
ranking ← f_θ(q) # current model's ranking
L ← 0
For each clicked d in session:
k ← rank of d in displayed y0
w ← min(1 / p_k, τ) # inverse propensity, clipped
L ← L + w · Δ(d | ranking) # pairwise / ΔNDCG loss term
θ ← θ − α · ∇_θ L # gradient step (SGD)
# 3) Offline evaluation (unbiased DCG estimate)
DCG_IPW(f) ← Σ_sessions Σ_{d: c_d=1} c_d / (p_{k(d)} · log2(rank_f(d)+1))
return f_θConnections
- Built on: Examination Hypothesis, Position-Based Click Model, Click Models
- Core correction: Inverse Propensity Weighting / Importance Sampling
- Offline framing: Counterfactual Learning to Rank
- Variance reduction: Doubly Robust Estimation
- Biases addressed: Position Bias, Cascading Position Bias, Trust Bias, Item Selection Bias, Outlier Bias
- Loss machinery inherited from: Learning to Rank, Pairwise Learning to Rank, LambdaRank