Decision Transformer

Decision Transformer

An approach to Offline Reinforcement Learning that casts RL as a sequence modeling problem. Instead of estimating value functions or computing policy gradients, it trains a GPT-style autoregressive transformer on trajectories, conditioning on desired return-to-go to generate actions.

Core Idea

RL as Sequence Modeling

Instead of asking “what action maximizes expected future reward?”, the Decision Transformer asks: “given that I want total return $G$ , and I’m in state $s$ , what action should I take?” This reframes RL as a conditional generation problem — no Bellman equations, no temporal difference learning.

Trajectory Representation

Trajectories are preprocessed into triples of (return-to-go, state, action):

$τ = (\hat{G}_{0}, s_{0}, a_{0}, \hat{G}_{1}, s_{1}, a_{1}, \dots, \hat{G}_{T}, s_{T}, a_{T})$

where $\hat{G}_{t} = \sum_{t^{'} = t}^{T} r_{t^{'}}$ is the return-to-go (total remaining reward from timestep $t$ ).

Architecture

Uses a GPT (causal transformer) architecture
Input: sequence of $(\hat{G}_{t}, s_{t}, a_{t})$ triples, each embedded and fed as tokens
Each modality (return, state, action) has its own linear embedding layer
Positional encoding shared across the triple at each timestep
Output: predicted action $a_{t}$ given context $(\hat{G}_{0}, s_{0}, a_{0}, \dots, \hat{G}_{t}, s_{t})$
Trained with standard cross-entropy (discrete) or MSE (continuous) loss on actions

Inference (Test Time)

Set desired return-to-go $\hat{G}_{0}$ to a target performance level
Observe current state $s_{0}$
Model predicts action $a_{0}$
Execute $a_{0}$ , observe $r_{0}$ , $s_{1}$
Update: $\hat{G}_{1} = \hat{G}_{0} - r_{0}$
Repeat

Controlling Performance

By conditioning on different return-to-go values, you can control the agent’s behavior: high $\hat{G}$ produces expert-level behavior, lower $\hat{G}$ produces more conservative behavior.

Key Properties

No value estimation: avoids issues with bootstrapping, deadly triad, etc.
Offline: learns entirely from logged data, no environment interaction needed
Simple training: standard supervised learning (next-token prediction)
Hindsight conditioning: learns from suboptimal data by conditioning on actual achieved returns
Limitation: shares weaknesses of Monte Carlo Methods — relies on full trajectory returns, not bootstrapped estimates

Comparison with Standard RL

Aspect	Standard RL	Decision Transformer
Objective	Maximize expected return	Predict actions given desired return
Training	TD/MC + policy gradients	Supervised (next-token prediction)
Value functions	Required	Not needed
Bellman equations	Core component	Not used
Data	On-policy or off-policy	Offline dataset
Architecture	Various	Transformer (GPT)

Connections

Alternative to value-based Offline Reinforcement Learning (e.g., Conservative Q-Learning (CQL))
Related to Decision Diffuser (uses diffusion instead of transformers)
Related to “Upside-Down RL” (single state input, no sequence model)
Related to “Trajectory Transformer” (concurrent work, also predicts states and returns)
Builds on transformer/GPT architecture from NLP

Appears In

RL-L11 - SAC, Decision Transformer & Diffuser
Chen et al., “Decision Transformer: Reinforcement Learning via Sequence Modeling” (2021)

Study Notes

Explorer

Decision Transformer

Decision Transformer

Core Idea

Trajectory Representation

Architecture

Inference (Test Time)

Key Properties

Comparison with Standard RL

Connections

Appears In

Graph View

Table of Contents

Backlinks