Direct Preference Optimization (DPO)

Lecture context

Optimize directly on preferred-vs-rejected pairs without a separate reward model.

Definition

Direct Preference Optimization (DPO)

DPO is a preference-tuning objective that aligns a generative model on (preferred, rejected) response pairs without ever training an explicit reward model and without an RL loop. It is the standard “no-RL” alternative to RLHF: instead of learning a reward and then running PPO against it, DPO shows that the optimal RLHF policy has a closed form, and uses that fact to fold reward learning and policy optimization into a single classification loss on the preference pairs.

In generative recommendation it is one of four ways to shape the training objective once items are tokens — alongside SFT, self-supervised/contrastive learning, and reward-based RL (e.g. GRPO). DPO directly teaches the model to rank a preferred next-item identifier (a positive Semantic ID) above a rejected one.

Intuition

The reward model is hiding inside the policy

RLHF is two stages: (1) fit a reward model to preference data, (2) optimize the policy against that reward with a KL leash to a frozen reference model . DPO’s key observation is that for the standard KL-regularized RLHF objective, the optimal policy and the reward are related in closed form — so the reward can be written as a function of the policy itself (specifically the log-ratio ).

Substituting that into the Bradley–Terry preference model collapses the whole pipeline into one logistic-regression-style loss: push up the log-probability of the preferred response relative to the reference, push down the rejected response . No reward network, no sampling, no on-policy rollouts — just a supervised loss over pairs. This is why the slides list DPO as “no reward model needed; training is stable,” in direct contrast to RL which is “reward-driven… needs feedback and is unstable to train.”

Mathematical Formulation

The KL-regularized RLHF objective DPO starts from is

Its optimal policy is , which can be inverted to express the reward as . Plugging this into the Bradley–Terry model makes cancel and yields the DPO loss:

where:

  • — the prompt / context (in RecSys: the user interaction history, or its tokenized Semantic ID sequence)
  • — the preferred (“winning”) response; in GenRec, the positive next-item identifier the user actually engaged with
  • — the rejected (“losing”) response; a negative / dispreferred item identifier
  • — the policy being trained (the generative model)
  • — the frozen reference policy, usually the SFT checkpoint; the KL anchor that keeps from drifting
  • — temperature controlling how hard the KL constraint pulls toward (larger = stay closer to reference)
  • — the logistic sigmoid; — Kullback–Leibler divergence; — the (cancelled) partition function
  • — the implicit reward DPO optimizes; the loss is a binary classifier on

The gradient is informative:

It raises and lowers , weighted by how badly the current implicit reward ranks the pair (the term is large exactly when the model is wrong) — an automatic hard-example weighting that a naive log-likelihood objective lacks.

Key Properties / Variants

  • No reward model, no RL loop. Reward learning and policy optimization are merged into one supervised loss; there is no separate network and no PPO-style sampling. This is the main reason the lecture flags DPO as more stable and cheaper to train than RL.
  • Reference model is required. The frozen (typically the SFT model) appears in every term; it both defines the implicit reward and regularizes the update. DPO is normally run after an SFT stage.
  • Off-policy / offline. It learns from a fixed dataset of pre-collected preference pairs — no fresh on-policy rollouts are needed, unlike GRPO or PPO.
  • trades fit vs. drift. Small lets the policy move far from the reference (sharper preferences, more overfitting/degeneracy risk); large keeps it conservative.
  • Position in the GenRec objective menu (RS-L03b §4.1.3): the four training-objective choices are SFT (positives only, weak margin), SSL/contrastive (template-robust), RL (encodes explicit negatives & non-differentiable metrics, but unstable), and DPO (direct preferred-vs-rejected pairs, stable). RecSys variants named in the lectures: LettinGo, RosePO, SPRec, and S-DPO (softmax/multi-negative DPO for sequential recommendation); listed alongside GRPO and Rec-R1 as preference/RL fine-tuning for generative recommenders.
  • What a “pair” is in RecSys. = user history; = a positive item (its Semantic ID / identifier sequence); = a negative — a non-interacted, low-reward, or invalid item ID. This lets DPO inject the explicit-negative signal that plain next-item SFT (positives-only cross-entropy) cannot represent.
Algorithm: DPO (offline preference tuning)
──────────────────────────────────────────────
Inputs: SFT model π_ref (frozen), preference data D = {(x, y_w, y_l)}, β
Initialize π_θ ← π_ref
 
Loop over minibatches {(x, y_w, y_l)} ~ D:
    # log-probs under both models (teacher-forced over the token sequence)
    lp_w_θ   = log π_θ(y_w | x);   lp_l_θ   = log π_θ(y_l | x)
    lp_w_ref = log π_ref(y_w | x); lp_l_ref = log π_ref(y_l | x)   # no grad
 
    # implicit reward log-ratios
    Δ_w = lp_w_θ - lp_w_ref
    Δ_l = lp_l_θ - lp_l_ref
 
    loss = -log σ( β * (Δ_w - Δ_l) )      # Bradley–Terry classification
    θ ← θ - η ∇_θ loss
return π_θ

Connections

Appears In