Decision Diffuser

Decision Diffuser

An Offline Reinforcement Learning method that uses diffusion models to generate future state trajectories, then derives actions via a separately trained inverse dynamics model. Unlike Decision Transformer, it generates states rather than actions directly, and can flexibly combine multiple conditioning signals (returns, constraints, skills).

Core Idea

Generating Trajectories, Not Actions

Instead of directly predicting actions, the Decision Diffuser generates a plan — a sequence of future states . A separate inverse dynamics model then computes which action to take: . This separation allows flexible conditioning on different objectives.

Architecture

  • Diffusion model : temporal U-Net that generates state trajectories
    • Conditioning information projected to a latent vector via MLP
    • Conditions with probability (for classifier-free guidance)
  • Inverse dynamics model : MLP that predicts from

Training

The diffusion model is trained to predict the noise added during the forward diffusion process (standard denoising objective). The inverse dynamics model is trained separately with supervised learning on pairs.

Conditioning (Classifier-Free Guidance)

The diffusion model can be conditioned on:

  • Returns: generate high-reward trajectories
  • Constraints: e.g., block height requirements like
  • Skills: generate behavior that interpolates between learned skills

Uses classifier-free guidance — during training, conditioning is randomly dropped (replaced with null conditioning) with some probability, so at inference the model can trade off conditioned vs unconditioned generation.

Low-Temperature Sampling

During denoising at inference, the variance of the predicted noise is reduced (low-temperature sampling), producing more deterministic and focused trajectory plans.

Results

  • Competitive with or outperforms baselines (behavioral cloning, Conservative Q-Learning (CQL), Decision Transformer) on offline RL benchmarks
  • Better at single constraints than baselines
  • Only model that can effectively combine multiple constraints
  • Can generate behavior ‘in between’ separately learned skills

Key Differences from Decision Transformer

AspectDecision TransformerDecision Diffuser
Generative modelAutoregressive transformerDiffusion model
GeneratesActions directlyState trajectories
Action derivationDirect outputInverse dynamics model
ConditioningReturn-to-go onlyReturns, constraints, skills
Multi-objectiveLimitedNatural (classifier-free guidance)

Connections

  • Alternative to Decision Transformer for offline RL
  • Uses diffusion models (from generative AI / image generation)
  • Related to “Diffuser” (Janner et al., 2022) which uses classifier guidance and predicts state-action pairs
  • Builds on denoising diffusion probabilistic models (DDPM)
  • Part of the broader trend of applying sequence/generative models to RL

Appears In