Diffusion Models

Definition

Diffusion Model

A diffusion model is a generative model that learns a data distribution by reversing a fixed noising process. A forward (diffusion) process gradually corrupts a data point into pure Gaussian noise over steps; a learned reverse (denoising) process removes noise step-by-step to sample new data from noise. In recommendation it is the non-LLM generative backbone of slide RS-L03b’s “Generative Model (e.g., LLM, Diffusion)” box: it can either denoise recommender embeddings back onto the existing item pool, or generate new item content.

Intuition

Learn to denoise, then run the camera backwards

Imagine slowly adding static to a photo until nothing is left but noise — that is the fixed forward process, no learning required. The model’s only job is to learn the reverse step: given a noisy image at level , predict the noise that was added, and subtract a bit of it. Stack such tiny denoising steps and you can start from pure noise and walk back to a clean sample. Because each step solves an easy regression problem (predict the noise), training is stable — unlike a GAN’s adversarial game. For recommendation, “the photo” can be a user/item embedding (denoise it toward a real catalogue item, conditioned on collaborative signal) or actual content like a fashion image.

Mathematical Formulation

Forward noising, reverse denoising, and the training loss (DDPM)

Forward process (fixed, no parameters) — add Gaussian noise on a variance schedule : A convenient closed form lets us jump to any step in one shot (with , ): Reverse process (learned) — a network parameterizes each denoising step: Training objective — instead of the full variational bound, DDPM trains a noise-predictor with a simple MSE:

where:

  • — clean data sample (an image, or a recommender embedding)
  • — the sample after noising steps;
  • — variance schedule (how much noise is added at step )
  • , — cumulative signal retained up to step
  • — the Gaussian noise actually added; — the network’s prediction of it
  • — diffusion step, sampled uniformly from during training
  • — mean/covariance of the learned reverse step (recoverable from )

Conditioning. To make sampling controllable (the recommendation use), the denoiser takes a condition — e.g., user history or a CF embedding — giving , optionally combined with Classifier-Free Guidance.

Key Properties / Variants

  • Sampling (ancestral) algorithm. Generation is the reverse loop, one denoising step at a time:
Algorithm: DDPM Sampling (conditioned on c)
──────────────────────────────────────────────
x_T ~ N(0, I)                       # start from pure noise
for t = T, T-1, ..., 1:
    z ~ N(0, I) if t > 1 else z = 0
    eps = eps_θ(x_t, t, c)          # predict the noise (c = user/CF condition)
    # one reverse step: subtract a scaled portion of predicted noise
    x_{t-1} = (1/√α_t) * ( x_t - (β_t / √(1-ᾱ_t)) * eps ) + √β_t * z
return x_0                          # clean sample (embedding or content)
  • Why it is stable to train. The loss is a plain per-step MSE on noise — no adversarial min-max (unlike GANs), no autoregressive token ordering. This is the key contrast with the autoregressive semantic-ID decoders (TIGER-style) that dominate the rest of the GenRec lecture.
  • Three roles “generative” plays in RecSys (RS-L04 slide 3 explicitly disambiguates the term):
    1. Generate item identifiers — autoregressive over semantic IDs (TIGER); not a diffusion model. This is the main lecture focus.
    2. Diffusion for embedding denoising (DDRM, SIGIR 2024) — diffusion denoises user/item embeddings; collaborative signal conditions the reverse process; output is grounded in the existing item pool, so no new item content is created.
    3. Diffusion for content generation (DiFashion, SIGIR 2024) — generates new item content (fashion images) conditioned on user history + constraints.
  • Trainable or frozen. On RS-L03b’s generative-recommender diagram the backbone carries both flame and snowflake icons — a diffusion denoiser can be trained on platform data or used as a frozen pretrained generator.
  • Continuous vs. discrete output. Diffusion operates naturally on continuous vectors (embeddings, pixels). To recommend a real item it must be grounded: either denoise toward and look up the nearest catalogue embedding (DDRM), or pair with a retrieval/ranking step — analogous to the validity/grounding problem the autoregressive route solves with a trie.
  • Latent diffusion. Running the process in a compressed latent space (rather than raw pixels/full embeddings) cuts cost — the standard trick for image generators and applicable to large recommender embedding spaces.
  • Cost. Sampling needs many sequential reverse steps ( can be hundreds), so inference latency is a real concern under a recommendation serving budget, mirroring the decoding-cost limitation of generative recommenders generally.

Connections

Appears In