Direct Preference Optimization (DPO)

Lecture context

Optimize directly on preferred-vs-rejected pairs without a separate reward model.

Definition

Direct Preference Optimization (DPO)

DPO is a preference-tuning objective that aligns a generative model on (preferred, rejected) response pairs without ever training an explicit reward model and without an RL loop. It is the standard “no-RL” alternative to RLHF: instead of learning a reward $r_{ϕ}$ and then running PPO against it, DPO shows that the optimal RLHF policy has a closed form, and uses that fact to fold reward learning and policy optimization into a single classification loss on the preference pairs.

In generative recommendation it is one of four ways to shape the training objective once items are tokens — alongside SFT, self-supervised/contrastive learning, and reward-based RL (e.g. GRPO). DPO directly teaches the model to rank a preferred next-item identifier (a positive Semantic ID) above a rejected one.

Intuition

The reward model is hiding inside the policy

RLHF is two stages: (1) fit a reward model to preference data, (2) optimize the policy against that reward with a KL leash to a frozen reference model $π_{ref}$ . DPO’s key observation is that for the standard KL-regularized RLHF objective, the optimal policy and the reward are related in closed form — so the reward can be written as a function of the policy itself (specifically the log-ratio $lo g \frac{π _{θ}}{π _{ref}}$ ).

Substituting that into the Bradley–Terry preference model collapses the whole pipeline into one logistic-regression-style loss: push up the log-probability of the preferred response $y_{w}$ relative to the reference, push down the rejected response $y_{l}$ . No reward network, no sampling, no on-policy rollouts — just a supervised loss over pairs. This is why the slides list DPO as “no reward model needed; training is stable,” in direct contrast to RL which is “reward-driven… needs feedback and is unstable to train.”

Mathematical Formulation

The KL-regularized RLHF objective DPO starts from is

π_{θ} max E_{x, y \sim π_{θ}} [r (x, y)] - β D_{KL} (π_{θ} (y ∣ x) ∥ π_{ref} (y ∣ x)) .

Its optimal policy is $π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y))$ , which can be inverted to express the reward as $r (x, y) = β lo g \frac{π ^{*} ( y ∣ x )}{π _{ref} ( y ∣ x )} + β lo g Z (x)$ . Plugging this into the Bradley–Terry model $P (y_{w} ≻ y_{l}) = σ (r (x, y_{w}) - r (x, y_{l}))$ makes $Z (x)$ cancel and yields the DPO loss:

L_{DPO} (π_{θ}; π_{ref}) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]

where:

$x$ — the prompt / context (in RecSys: the user interaction history, or its tokenized Semantic ID sequence)
$y_{w}$ — the preferred (“winning”) response; in GenRec, the positive next-item identifier the user actually engaged with
$y_{l}$ — the rejected (“losing”) response; a negative / dispreferred item identifier
$π_{θ}$ — the policy being trained (the generative model)
$π_{ref}$ — the frozen reference policy, usually the SFT checkpoint; the KL anchor that keeps $π_{θ}$ from drifting
$β$ — temperature controlling how hard the KL constraint pulls toward $π_{ref}$ (larger $β$ = stay closer to reference)
$σ$ — the logistic sigmoid; $D_{KL}$ — Kullback–Leibler divergence; $Z (x)$ — the (cancelled) partition function
$\overset{r}{^}_{θ} (x, y) = β lo g \frac{π _{θ} ( y ∣ x )}{π _{ref} ( y ∣ x )}$ — the implicit reward DPO optimizes; the loss is a binary classifier on $\overset{r}{^}_{θ} (x, y_{w}) - \overset{r}{^}_{θ} (x, y_{l})$

The gradient is informative:

\nabla_{θ} L_{DPO} = - β E [σ (\overset{r}{^}_{θ} (x, y_{l}) - \overset{r}{^}_{θ} (x, y_{w})) (\nabla_{θ} lo g π_{θ} (y_{w} ∣ x) - \nabla_{θ} lo g π_{θ} (y_{l} ∣ x))]

It raises $lo g π_{θ} (y_{w})$ and lowers $lo g π_{θ} (y_{l})$ , weighted by how badly the current implicit reward ranks the pair (the $σ (\cdot)$ term is large exactly when the model is wrong) — an automatic hard-example weighting that a naive log-likelihood objective lacks.

Key Properties / Variants

No reward model, no RL loop. Reward learning and policy optimization are merged into one supervised loss; there is no separate $r_{ϕ}$ network and no PPO-style sampling. This is the main reason the lecture flags DPO as more stable and cheaper to train than RL.
Reference model is required. The frozen $π_{ref}$ (typically the SFT model) appears in every term; it both defines the implicit reward and regularizes the update. DPO is normally run after an SFT stage.
Off-policy / offline. It learns from a fixed dataset of pre-collected preference pairs $D$ — no fresh on-policy rollouts are needed, unlike GRPO or PPO.
$β$ trades fit vs. drift. Small $β$ lets the policy move far from the reference (sharper preferences, more overfitting/degeneracy risk); large $β$ keeps it conservative.
Position in the GenRec objective menu (RS-L03b §4.1.3): the four training-objective choices are SFT (positives only, weak margin), SSL/contrastive (template-robust), RL (encodes explicit negatives & non-differentiable metrics, but unstable), and DPO (direct preferred-vs-rejected pairs, stable). RecSys variants named in the lectures: LettinGo, RosePO, SPRec, and S-DPO (softmax/multi-negative DPO for sequential recommendation); listed alongside GRPO and Rec-R1 as preference/RL fine-tuning for generative recommenders.
What a “pair” is in RecSys. $x$ = user history; $y_{w}$ = a positive item (its Semantic ID / identifier sequence); $y_{l}$ = a negative — a non-interacted, low-reward, or invalid item ID. This lets DPO inject the explicit-negative signal that plain next-item SFT (positives-only cross-entropy) cannot represent.

Algorithm: DPO (offline preference tuning)
──────────────────────────────────────────────
Inputs: SFT model π_ref (frozen), preference data D = {(x, y_w, y_l)}, β
Initialize π_θ ← π_ref
 
Loop over minibatches {(x, y_w, y_l)} ~ D:
    # log-probs under both models (teacher-forced over the token sequence)
    lp_w_θ   = log π_θ(y_w | x);   lp_l_θ   = log π_θ(y_l | x)
    lp_w_ref = log π_ref(y_w | x); lp_l_ref = log π_ref(y_l | x)   # no grad
 
    # implicit reward log-ratios
    Δ_w = lp_w_θ - lp_w_ref
    Δ_l = lp_l_θ - lp_l_ref
 
    loss = -log σ( β * (Δ_w - Δ_l) )      # Bradley–Terry classification
    θ ← θ - η ∇_θ loss
return π_θ

Connections

Replaces the two-stage pipeline of: Reinforcement Learning from Human Feedback (reward model + PPO)
Alternative to: GRPO (on-policy, group-relative, sampling-based reward fine-tuning) for the same “go beyond cross-entropy” goal
Usually preceded by: Supervised Fine-Tuning (SFT) (provides the reference policy $π_{ref}$ )
Sits in the objective menu beside: Contrastive Learning / self-supervised pretraining, Negative Sampling
Foundations: an instance of off-policy preference optimization; uses the KL-regularized objective and a logistic (Bradley–Terry) preference model
Applied over: Semantic IDs generated by a Generative Recommender (e.g. TIGER-style token sequences)
Contrast in stability with: RL (reward-driven, unstable to train per the lecture)

Study Notes

Explorer

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO)

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks