Decision Transformer
Decision Transformer
An approach to Offline Reinforcement Learning that casts RL as a sequence modeling problem. Instead of estimating value functions or computing policy gradients, it trains a GPT-style autoregressive transformer on trajectories, conditioning on desired return-to-go to generate actions.
Core Idea
RL as Sequence Modeling
Instead of asking “what action maximizes expected future reward?”, the Decision Transformer asks: “given that I want total return , and I’m in state , what action should I take?” This reframes RL as a conditional generation problem — no Bellman equations, no temporal difference learning.
Trajectory Representation
Trajectories are preprocessed into triples of (return-to-go, state, action):
where is the return-to-go (total remaining reward from timestep ).
Architecture
- Uses a GPT (causal transformer) architecture
- Input: sequence of triples, each embedded and fed as tokens
- Each modality (return, state, action) has its own linear embedding layer
- Positional encoding shared across the triple at each timestep
- Output: predicted action given context
- Trained with standard cross-entropy (discrete) or MSE (continuous) loss on actions
Inference (Test Time)
- Set desired return-to-go to a target performance level
- Observe current state
- Model predicts action
- Execute , observe ,
- Update:
- Repeat
Controlling Performance
By conditioning on different return-to-go values, you can control the agent’s behavior: high produces expert-level behavior, lower produces more conservative behavior.
Key Properties
- No value estimation: avoids issues with bootstrapping, deadly triad, etc.
- Offline: learns entirely from logged data, no environment interaction needed
- Simple training: standard supervised learning (next-token prediction)
- Hindsight conditioning: learns from suboptimal data by conditioning on actual achieved returns
- Limitation: shares weaknesses of Monte Carlo Methods — relies on full trajectory returns, not bootstrapped estimates
Comparison with Standard RL
| Aspect | Standard RL | Decision Transformer |
|---|---|---|
| Objective | Maximize expected return | Predict actions given desired return |
| Training | TD/MC + policy gradients | Supervised (next-token prediction) |
| Value functions | Required | Not needed |
| Bellman equations | Core component | Not used |
| Data | On-policy or off-policy | Offline dataset |
| Architecture | Various | Transformer (GPT) |
Connections
- Alternative to value-based Offline Reinforcement Learning (e.g., Conservative Q-Learning (CQL))
- Related to Decision Diffuser (uses diffusion instead of transformers)
- Related to “Upside-Down RL” (single state input, no sequence model)
- Related to “Trajectory Transformer” (concurrent work, also predicts states and returns)
- Builds on transformer/GPT architecture from NLP
Appears In
- RL-L11 - SAC, Decision Transformer & Diffuser
- Chen et al., “Decision Transformer: Reinforcement Learning via Sequence Modeling” (2021)