Multiple Additive Regression Trees

Definition

Multiple Additive Regression Trees (MART)

MART is gradient-boosted regression trees applied to ranking. It builds an additive ensemble of shallow regression trees, where each new tree is fit to the negative gradient (the pseudo-residuals) of the loss with respect to the model’s current scores. The final scoring function is the sum of all tree outputs, and it underlies LambdaMART — the tree-based realization of the listwise LambdaRank objective.

Intuition

A single regression tree is a weak learner: it partitions the feature space into a few regions and predicts a constant in each. MART makes it strong by boosting — fitting trees sequentially, each one correcting the errors left by the ensemble so far.

The key trick is gradient boosting: instead of fixing a specific loss form, MART treats the current ensemble’s scores as parameters and does functional gradient descent on the loss. At each round it computes the gradient of the loss w.r.t. each training point’s score, and trains the next tree to predict that (negated) gradient. Adding a small step in that direction reduces the loss — exactly like Gradient Descent, but in function space and with trees as the descent direction.

For ranking, the loss is not differentiable (it depends on the metric, e.g. NDCG). MART sidesteps this: in LambdaMART the per-document gradient is supplied directly by the LambdaRank “lambdas” (metric-weighted pairwise gradients), so MART never needs an explicit loss — only the gradients.

Mathematical Formulation

MART builds an additive model after boosting rounds:

where:

  • — feature vector of a query-document pair
  • — the -th regression tree (a piecewise-constant function)
  • — leaf/step value(s) chosen to minimize the loss along
  • shrinkage (learning rate) that scales each tree’s contribution
  • — final ranking score (documents sorted by descending )

At round , each tree is fit to the negative gradient of the loss evaluated at the current scores (the pseudo-residuals):

where:

  • — pseudo-residual (descent direction) for training example at round
  • — label / relevance grade for example
  • — per-example loss (e.g. squared error for plain regression)

The tree is trained by least-squares to predict , and within each leaf region the optimal constant is

For LambdaMART, the residual is replaced by the LambdaRank lambda (sum of metric-scaled pairwise gradients), so the loss need never be written down explicitly:

where:

  • — pseudo-gradient for document (used in place of )
  • — change in NDCG from swapping documents and
  • — current ensemble score of document

Key Properties / Variants

  • Additive ensemble of weak learners: final score is a sum of many shallow trees; trees are fixed once added (no backprop through earlier trees).
  • Loss-agnostic via gradients: works for any differentiable loss (squared error, logistic) — and even for non-differentiable ranking objectives, by supplying gradients directly (LambdaMART).
  • Regularization knobs: shrinkage (small values like 0.1 generalize better but need more rounds), tree depth / number of leaves, number of trees , and subsampling of data/features (stochastic gradient boosting).
  • Handles raw features well: insensitive to feature scaling and monotone transforms; handles mixed numeric features common in Learning to Rank feature sets.
  • Strong LTR baseline: LambdaMART (MART + LambdaRank gradients) won the Yahoo! Learning to Rank Challenge and remains a hard-to-beat baseline for tabular ranking features, often competitive with neural rankers.
  • Variants: GBRT/GBDT (the general gradient-boosting family), stochastic gradient boosting (row/column subsampling), and efficient implementations (XGBoost, LightGBM) used to train MART/LambdaMART at scale.
Algorithm: MART (Gradient-Boosted Regression Trees)
───────────────────────────────────────────────────
Input: training data {(x_i, y_i)}, loss L, rounds M,
       shrinkage ν, tree size (max leaves J)
 
Initialize F_0(x) = argmin_c Σ_i L(y_i, c)   (constant model)
 
For m = 1 ... M:
    # 1. Pseudo-residuals = negative gradient at current scores
    For each i:
        r_im = -[ ∂L(y_i, F(x_i)) / ∂F(x_i) ]_{F = F_{m-1}}
        # (LambdaMART: set r_im = λ_i, the metric-weighted lambdas)
 
    # 2. Fit a J-leaf regression tree h_m to the targets r_im
    Fit h_m by least squares to {(x_i, r_im)}
 
    # 3. Per-leaf optimal step (line search within each region R_jm)
    For each leaf region R_jm:
        γ_jm = argmin_γ Σ_{x_i ∈ R_jm} L(y_i, F_{m-1}(x_i) + γ)
 
    # 4. Additive update with shrinkage
    F_m(x) = F_{m-1}(x) + ν * Σ_j γ_jm * 1[x ∈ R_jm]
 
Return F_M(x)   # rank documents by descending F_M

Connections

Appears In