Listwise LTR

Definition

Listwise Learning to Rank is an approach to Learning to Rank that considers the entire ranking (list of documents) when computing the loss function. Unlike pointwise methods (single documents) or pairwise methods (pairs), listwise methods directly optimize ranking quality.

Two main families:

Probabilistic: Model ranking distributions using the Plackett-Luce model
Metric-based: Directly approximate or optimize ranking metrics like NDCG

Intuition

The core insight: A ranking is a permutation of documents, not just individual scores or pairwise comparisons. By considering all documents together, we can:

Directly optimize ranking metrics (NDCG, MAP)
Account for position-sensitive importance (top-10 matters more than bottom-10)
Avoid pathological cases where pointwise/pairwise methods produce poor results

Query: "machine learning"
Documents: A (rel=2), B (rel=0), C (rel=1)

Listwise loss considers the entire ranking:
- Ranking [A, C, B]: gains at positions 1,2,3 weighted by position
  Loss = compute_metric([2, 1, 0]) 
  
- Ranking [B, A, C]: would be terrible, high loss
- Ranking [A, C, B]: good ranking, low loss

Instead of independent scores or just pairwise comparisons!

Mathematical Formulation

Probabilistic Listwise: Plackett-Luce Model

The Plackett-Luce (PL) model generates random rankings by sequential sampling without replacement:

Top-1 Probability (for a single document)

$P (d_{i} ∣ remaining docs) = \frac{e ^{s_{i}}}{\sum _{j} e ^{s_{j}}}$

This is the softmax function: probability of selecting document $i$ is proportional to its score relative to all others.

Full Ranking Probability

For a ranking $π = (d_{2}, d_{1}, d_{3}, ...)$ of $n$ documents:

$P (π ∣ s) = \prod_{k = 1}^{n} \frac{e ^{s_{π (k)}}}{\sum _{j = k}^{n} e ^{s_{π (j)}}}$

At each step $k$ , select document $π (k)$ from the remaining $n - k + 1$ documents with probability proportional to its score.

Example with 3 documents:

$P (π = (d_{2}, d_{1}, d_{3}) ∣ s) = step 1: select d_{2} \frac{e ^{s_{2}}}{e ^{s_{1}} + e ^{s_{2}} + e ^{s_{3}}} \cdot step 2: select d_{1} \frac{e ^{s_{1}}}{e ^{s_{1}} + e ^{s_{3}}} \cdot step 3: select d_{3} \frac{e ^{s_{3}}}{e ^{s_{3}}}$

Computational Challenge

The number of possible rankings is $n!$ , which grows factorially. For $n = 10$ , there are $3.6 \times 1 0^{6}$ possible rankings. Computing the full distribution is infeasible.

ListNet (Cao et al., 2007)

ListNet simplifies by considering only top-1 probabilities instead of full ranking distributions.

Target Distribution

Based on relevance labels, compute the softmax:

$P^{*} (d_{i}) = \frac{e ^{y_{i}}}{\sum _{j = 1}^{n} e ^{y_{j}}}$

This is the probability that document $i$ would be selected first if we sampled from the label distribution.

Predicted Distribution

From the model scores:

$P (d_{i}) = \frac{e ^{s_{i}}}{\sum _{j = 1}^{n} e ^{s_{j}}}$

ListNet Loss

Cross-entropy between target and predicted distributions:

$L_{ListNet} = - \sum_{i = 1}^{n} P^{*} (d_{i}) lo g P (d_{i})$

Expanding:

$L = - \sum_{i = 1}^{n} \frac{e ^{y_{i}}}{\sum _{j} e ^{y_{j}}} lo g \frac{e ^{s_{i}}}{\sum _{j} e ^{s_{j}}}$

Advantage: Computationally efficient - just softmax.

Limitation: Only considers top-1, ignores structure of full ranking.

ListMLE (Xia et al., 2008)

ListMLE directly maximizes the probability of the ground-truth ranking under the Plackett-Luce model.

The Ground-Truth Ranking

Order documents by their relevance labels in descending order:

$π^{*} = argsort (y, descending)$

Example: If labels are $y = [2, 0, 1]$ , then $π^{*} = (d_{1}, d_{3}, d_{2})$ .

ListMLE Loss

For each position $i$ in the ground-truth ranking, compute:

$L_{ListMLE} = - \sum_{i = 1}^{n} lo g \frac{e ^{s_{π^{*} (i)}}}{\sum _{j = i}^{n} e ^{s_{π^{*} (j)}}}$

Interpretation: At each position, maximize probability that the ground-truth document appears at that position, given the remaining documents.

Step 1 ( $i = 1$ ): Among all $n$ documents, the best should rank first $- lo g \frac{e ^{s_{best}}}{\sum _{j = 1}^{n} e ^{s_{j}}}$

Step 2 ( $i = 2$ ): Among remaining $n - 1$ documents, the second-best should rank second $- lo g \frac{e ^{s_{2nd-best}}}{e ^{s_{2nd-best}} + e ^{s_{3rd-best}} + ...}$

And so on.

Advantage: Directly models full ranking structure.

Complexity: $O (n lo g n)$ for sorting + $O (n)$ for loss.

Note: Handles ties (multiple valid rankings) naturally.

Metric-Based Listwise: ApproxNDCG

ApproxNDCG directly approximates the NDCG metric using smooth approximation.

The Ranking Function (Non-differentiable)

The rank of document $i$ in a sorted list:

$rank (d_{i}) = 1 + \sum_{j : j \neq = i} 1 [s_{j} > s_{i}]$

where $1 [\cdot]$ is the indicator function.

Problem: The indicator function has zero gradient everywhere.

Smooth Approximation

Replace the indicator with sigmoid:

$1 [s_{j} > s_{i}] \approx \frac{1}{1 + e ^{(s_{i} - s_{j})}} = σ (s_{i} - s_{j})$

Approximate rank:

$\tilde{rank} (d_{i}) = 1 + \sum_{j : j \neq = i} \frac{1}{1 + e ^{(s_{i} - s_{j})}}$

ApproxNDCG Loss

Plug the approximate rank into DCG formula:

$DCG = \sum_{i = 1}^{n} \frac{2 ^{y_{i}} - 1}{l o g _{2} ( 1 + rank ~ ( d _{i} ))}$

Loss is negative DCG (to minimize):

$L_{ApproxNDCG} = - DCG$

Advantage: Directly optimizes NDCG metric.

Limitation: Assumes metric weights decrease smoothly with rank (not valid for fairness or absolute cutoff metrics).

Metric-Based Listwise: LambdaRank

LambdaRank (Burges et al., 2006) bridges pairwise and listwise thinking by scaling pairwise gradients by metric impact.

The Insight

To optimize a metric, we only need the gradient of the loss, not the loss itself. We can:

Use a differentiable loss (e.g., pairwise logistic)
Scale its gradient by how much the pair affects the ranking metric

LambdaRank Loss

$L_{LambdaRank} = \sum_{y_{i} > y_{j}} lo g (1 + e^{- (s_{i} - s_{j})}) \cdot ∣Δ M (i, j) ∣$

where $Δ M (i, j)$ is the change in ranking metric $M$ (e.g., NDCG) if we swap documents $i$ and $j$ .

Formula for gradient scaling:

$\frac{\partial L}{\partial s _{i}} = \sum_{j : y_{j} > y_{i}} (1 - σ (s_{j} - s_{i})) \cdot ∣Δ M (j, i) ∣ - \sum_{j : y_{i} > y_{j}} σ (s_{i} - s_{j}) \cdot ∣Δ M (i, j) ∣$

Key Effect: Pairs at top positions (which affect NDCG more) get larger gradients.

LambdaMART

LambdaMART applies LambdaRank with MART (Multiple Additive Regression Trees). This remains one of the strongest baseline LTR methods in practice.

Key Properties

Ranking-Aware: Considers entire document lists
Metric-Aligned: Can directly optimize ranking metrics (in metric-based variants)
Position-Sensitive: Can weight top positions more heavily
Theoretically Grounded: Probabilistic variants based on Plackett-Luce model
Computationally Intensive: Higher per-batch cost than pairwise methods

Advantages

Direct Metric Optimization: Metric-based methods optimize what you actually care about
Position-Aware: Can naturally emphasize top-ranked documents
Theoretically Sound: Probabilistic methods grounded in ranking theory
Better Quality: Generally outperforms pointwise/pairwise when computational budget allows
Handles Ties: Works well when multiple valid rankings exist

Limitations

Computational Cost: More expensive per training batch than pairwise methods
Scalability: Full PL model intractable for large $n$ (typically $n \leq 100$ )
Metric Assumptions: ApproxNDCG assumes smooth metric decay (not all metrics qualify)
Target Probabilities: ListNet uses somewhat arbitrary softmax over labels
Memory: Typically need all documents per query in memory

Comparison of Listwise Methods

Method	Input	Output	Metric	Computational
ListNet	Labels	Probability distribution	Top-1 alignment	$O (n)$
ListMLE	Ranking order	PL probability	Full ranking	$O (n lo g n)$
ApproxNDCG	Scores	Approximate NDCG	Direct NDCG approx.	$O (n^{2})$
LambdaRank	Scores + metric	Metric-scaled gradient	Any metric	$O (n^{2})$

Modern Extensions

Deep Listwise: Neural networks for end-to-end learning
Differentiable Sorting: Research on fully differentiable sorting
Multi-Task Learning: Combine multiple ranking objectives (NDCG, MAP, MRR)
Domain Adaptation: Transfer learning across different ranking tasks

When to Use

✓ Use listwise LTR when:

NDCG or other position-sensitive metrics are critical
You have all documents in memory (small lists)
You can afford computational cost
Top-position ranking quality matters most
You want theoretically justified optimization

✗ Avoid listwise LTR when:

Memory is extremely constrained
You have massive lists ( $n > 1000$ )
Metric optimization not critical
Pairwise methods already working well

Connections

Related Methods: Pointwise LTR, Pairwise LTR
Foundations: Based on Plackett-Luce model, ranking theory
Modern Usage: Integrated into neural ranking with Transformers, BERT for IR
Ranking Metrics: NDCG, MAP, Precision, Recall
Practice: LambdaMART remains industry standard in many systems

Appears In

IR-L10 - Learning to Rank (lecture)
Learning to Rank (concept)
Pointwise LTR (concept)
Pairwise LTR (concept)

References

Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In ICML.
Xia, F., Liu, T. Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approach to learning to rank. In ICML.
Burges, C., Shaked, T., Renshaw, E., et al. (2006). Learning to rank using gradient descent. In ICML.
Qin, T., Liu, T. Y., & Li, H. (2010). A general approximation framework for direct optimization of information retrieval measures. Information Retrieval, 13(4), 375-397.
Luce, R. D. (2012). Individual choice behavior: A theoretical analysis. Courier Corporation.
Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society, 24(2), 193-202.
Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in IR, 3(3), 225-331.

Study Notes

Explorer

Listwise LTR

Listwise LTR

Definition

Intuition

Mathematical Formulation

Probabilistic Listwise: Plackett-Luce Model

Top-1 Probability (for a single document)

Full Ranking Probability

Computational Challenge

ListNet (Cao et al., 2007)

Target Distribution

Predicted Distribution

ListNet Loss

ListMLE (Xia et al., 2008)

The Ground-Truth Ranking

ListMLE Loss

Metric-Based Listwise: ApproxNDCG

The Ranking Function (Non-differentiable)

Smooth Approximation

ApproxNDCG Loss

Metric-Based Listwise: LambdaRank

The Insight

LambdaRank Loss

LambdaMART

Key Properties

Advantages

Limitations

Comparison of Listwise Methods

Modern Extensions

When to Use

Connections

Appears In

References

Graph View

Table of Contents

Backlinks