Decision Transformer

Decision Transformer

An approach to Offline Reinforcement Learning that casts RL as a sequence modeling problem. Instead of estimating value functions or computing policy gradients, it trains a GPT-style autoregressive transformer on trajectories, conditioning on desired return-to-go to generate actions.

Core Idea

RL as Sequence Modeling

Instead of asking “what action maximizes expected future reward?”, the Decision Transformer asks: “given that I want total return , and I’m in state , what action should I take?” This reframes RL as a conditional generation problem — no Bellman equations, no temporal difference learning.

Trajectory Representation

Trajectories are preprocessed into triples of (return-to-go, state, action):

where is the return-to-go (total remaining reward from timestep ).

Architecture

  • Uses a GPT (causal transformer) architecture
  • Input: sequence of triples, each embedded and fed as tokens
  • Each modality (return, state, action) has its own linear embedding layer
  • Positional encoding shared across the triple at each timestep
  • Output: predicted action given context
  • Trained with standard cross-entropy (discrete) or MSE (continuous) loss on actions

Inference (Test Time)

  1. Set desired return-to-go to a target performance level
  2. Observe current state
  3. Model predicts action
  4. Execute , observe ,
  5. Update:
  6. Repeat

Controlling Performance

By conditioning on different return-to-go values, you can control the agent’s behavior: high produces expert-level behavior, lower produces more conservative behavior.

Key Properties

  • No value estimation: avoids issues with bootstrapping, deadly triad, etc.
  • Offline: learns entirely from logged data, no environment interaction needed
  • Simple training: standard supervised learning (next-token prediction)
  • Hindsight conditioning: learns from suboptimal data by conditioning on actual achieved returns
  • Limitation: shares weaknesses of Monte Carlo Methods — relies on full trajectory returns, not bootstrapped estimates

Comparison with Standard RL

AspectStandard RLDecision Transformer
ObjectiveMaximize expected returnPredict actions given desired return
TrainingTD/MC + policy gradientsSupervised (next-token prediction)
Value functionsRequiredNot needed
Bellman equationsCore componentNot used
DataOn-policy or off-policyOffline dataset
ArchitectureVariousTransformer (GPT)

Connections

Appears In