Classifier-Free Guidance

Definition

Classifier-Free Guidance (CFG)

Classifier-free guidance is a technique for conditional sampling from diffusion models that steers the reverse (denoising) process toward samples with a desired property $y$ without training a separate classifier $p (y ∣ x)$ . A single noise-prediction network $ϵ_{θ}$ is trained jointly on conditioned inputs $ϵ_{θ} (x_{k}, y, k)$ and unconditioned inputs $ϵ_{θ} (x_{k}, \emptyset, k)$ (by randomly dropping the condition $y$ during training). At sampling time, the two predictions are linearly extrapolated with a guidance weight $ω$ to amplify the influence of $y$ . In the Decision Diffuser, $y$ is a return level, a constraint, or a skill, and $x$ is a state trajectory.

Intuition

One Network, Two Modes

A diffusion model generates data by starting from pure Gaussian noise and iteratively denoising it. To make that generation conditional (e.g. “produce a high-return trajectory”), the older approach (classifier guidance) trained a separate classifier on noisy data and pushed samples uphill along $\nabla_{x} lo g p (y ∣ x)$ — but training a classifier on noisy inputs is awkward and adds a second model.

CFG avoids this. During training, the same network sometimes sees the condition $y$ and sometimes sees a null token $\emptyset$ (the condition is “dropped”). So the network learns to denoise both conditionally and unconditionally. At inference, the difference between the conditional and unconditional noise predictions, $(ϵ_{θ} (x_{k}, y, k) - ϵ_{θ} (x_{k}, \emptyset, k))$ , implicitly points in the direction $\nabla_{x} lo g p (y ∣ x)$ — the very direction a classifier would have given. We then over-emphasize that direction by the weight $ω$ , sharpening how strongly the sample obeys $y$ .

Mathematical Formulation

A diffusion model defines a forward (noising) process that gradually corrupts data $x_{0}$ into noise over $K$ steps, and learns to reverse it.

Forward Diffusion (DDPM)

$q (x_{k} ∣ x_{k - 1}) = N (x_{k}; 1 - β_{k} x_{k - 1}, β_{k} I), x_{k} = \overset{α}{ˉ}_{k} x_{0} + 1 - \overset{α}{ˉ}_{k} ϵ$

where:

$x_{0}$ — clean sample (in Decision Diffuser, a state trajectory $(s_{t}, \dots, s_{t + H})$ )

$k = 1, \dots, K$ — diffusion timestep (not the RL time index)

$β_{k} \in (0, 1)$ — noise variance schedule

$\overset{α}{ˉ}_{k} = \prod_{i = 1}^{k} (1 - β_{i})$ — cumulative signal-retention factor

$ϵ \sim N (0, I)$ — the noise actually injected

The network $ϵ_{θ}$ is trained to predict that injected noise, with the condition $y$ randomly replaced by $\emptyset$ with probability $1 - p$ (dropout).

Denoising Training Loss (with condition dropout)

$L (θ) = E_{k, x_{0}, ϵ, β} [ϵ - ϵ_{θ} (x_{k}, (1 - β) y + β \emptyset, k)^{2}]$

where:

$ϵ_{θ}$ — noise-prediction network (a temporal U-Net in Decision Diffuser)

$y$ — the conditioning property (return, constraint, skill), projected to a latent $z$ via an MLP

$\emptyset$ — null / “unconditioned” token

$β \sim Bernoulli (1 - p)$ — drops the condition so the model also learns $ϵ_{θ} (x_{k}, \emptyset, k)$

At sampling time, the conditional and unconditional predictions are combined into a single guided noise estimate:

Classifier-Free Guided Noise Prediction

$\overset{ϵ}{^}_{θ} (x_{k}, y, k) = ϵ_{θ} (x_{k}, \emptyset, k) + ω (ϵ_{θ} (x_{k}, y, k) - ϵ_{θ} (x_{k}, \emptyset, k))$

where:

$\overset{ϵ}{^}_{θ}$ — guided noise used in the reverse step to compute $x_{k - 1}$

$ϵ_{θ} (x_{k}, \emptyset, k)$ — unconditional prediction

$ϵ_{θ} (x_{k}, y, k)$ — conditional prediction

$ω \geq 0$ — guidance weight: $ω = 0$ gives unconditional sampling, $ω = 1$ gives ordinary conditional sampling, $ω > 1$ amplifies adherence to $y$ (sharper conditioning, less diversity)

Why the Extrapolation Works

Score-matching theory says $ϵ_{θ} (x_{k}, k) \propto - \nabla_{x} lo g p (x_{k})$ . By Bayes’ rule $lo g p (x_{k} ∣ y) = lo g p (x_{k}) + lo g p (y ∣ x_{k}) - lo g p (y)$ , so the gap between conditional and unconditional scores equals the classifier gradient $\nabla_{x} lo g p (y ∣ x_{k})$ . CFG reconstructs that gradient without a classifier and scales it by $ω$ , recovering classifier guidance with strength $ω$ as a special case.

Key Properties / Variants

No separate classifier: avoids training a noise-robust classifier $p (y ∣ x_{k})$ ; one network handles both conditional and unconditional generation.
Guidance weight $ω$ trades fidelity vs diversity: larger $ω$ produces samples that more strongly satisfy $y$ but reduces sample diversity (and can introduce artifacts).
Condition dropout probability $1 - p$ : a hyperparameter; the model must see enough unconditioned examples to learn a usable $ϵ_{θ} (x_{k}, \emptyset, k)$ .
Composable conditions: because conditioning is just an input to $ϵ_{θ}$ , multiple guidance signals (e.g. several constraints) can be combined — the property the Decision Diffuser exploits to satisfy combinations of constraints, which classifier guidance struggles with.
vs classifier guidance: the original Diffuser (Janner et al., 2022) uses classifier guidance — it perturbs the reverse process by the gradient of a learned return predictor — and generates state-action pairs; Decision Diffuser switches to CFG and generates states only (with an Inverse Dynamics Model recovering actions).
Low-temperature sampling is typically combined with CFG at inference: reducing the variance of the predicted noise yields more deterministic, higher-quality plans.

Sampling procedure with CFG inside the reverse diffusion loop:

Algorithm: Conditional Sampling with Classifier-Free Guidance
─────────────────────────────────────────────────────────────
Input: trained ε_θ, condition y, guidance weight ω, schedule {β_k}
Sample x_K ~ N(0, I)                       # start from pure noise
 
Loop for k = K, K-1, ..., 1:
    # two forward passes through the SAME network
    ε_cond   ← ε_θ(x_k, y, k)              # conditional prediction
    ε_uncond ← ε_θ(x_k, ∅, k)              # unconditioned prediction
 
    # classifier-free guided noise (extrapolate)
    ε̂ ← ε_uncond + ω * (ε_cond - ε_uncond)
 
    # one reverse (denoising) step using ε̂
    x_{k-1} ← reverse_step(x_k, ε̂, k)      # optionally low-temperature
end Loop
 
return x_0                                 # e.g. a generated state trajectory
# Decision Diffuser then applies inverse dynamics: a_t = f_φ(s_t, s_{t+1})

Connections

Conditioning mechanism used by: Decision Diffuser
Action recovery after generation: Inverse Dynamics Model
Sits inside: Offline Reinforcement Learning (generate high-return plans from a fixed dataset)
Conceptual sibling: classifier guidance (original Diffuser, Janner et al.) — uses an explicit return-predictor gradient
Related conditioning idea in RL: Decision Transformer conditions on return-to-go via the input sequence rather than diffusion guidance
Contrast with value-based offline RL: Conservative Q-Learning (CQL)
Builds on denoising diffusion probabilistic models (DDPM) from generative vision

Appears In

RL-L11 - SAC, Decision Transformer & Diffuser

Study Notes

Explorer

Classifier-Free Guidance

Classifier-Free Guidance

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks