The Deadly Triad
Definition
The Deadly Triad
The Deadly Triad refers to the combination of three elements that, when present together, can cause RL algorithms to become unstable and diverge:
- Function Approximation — parameterized value function (not tabular)
- Bootstrapping — updating estimates based on other estimates (e.g., TD, not MC)
- Off-policy learning — learning about a policy different from the one generating data
Any Two Are Fine
Any two of these three elements is safe:
- FA + Bootstrapping + On-policy → Semi-gradient TD converges (linear case)
- FA + No bootstrapping + Off-policy → MC with importance sampling converges
- Tabular + Bootstrapping + Off-policy → Q-learning converges
It’s the combination of all three that can cause divergence.
Why It Diverges
The Feedback Loop from Hell
- Off-policy data comes from a different distribution than what the target policy visits
- Function approximation generalizes — updating one state changes values of others
- Bootstrapping uses these (potentially wrong) estimates as targets
- Result: errors can amplify rather than decrease. The updates push weights in a direction that increases error on states the target policy actually visits.
Baird’s Counterexample
The classic demonstration of deadly triad divergence:
- 7-state MDP with specific transitions
- Semi-gradient TD with linear FA
- Off-policy: behavior policy (uniform random) ≠ target policy
- Result: weights diverge to infinity, despite the problem being simple
Solutions and Mitigations
| Approach | How it helps |
|---|---|
| Use on-policy data | Removes off-policy element |
| Don’t bootstrap (MC) | Removes bootstrapping element |
| Use tabular | Removes FA element (but defeats the purpose for large problems) |
| Gradient-TD Methods (GTD, GTD2, TDC) | True gradient methods that converge even off-policy with linear FA |
| Experience Replay + Target Network (DQN) | Stabilizes training (doesn’t guarantee convergence but works in practice) |
| Emphatic-TD | Reweights updates to correct for off-policy distribution |
Connections
- Components: Function Approximation, Bootstrapping, On-Policy vs Off-Policy
- Demonstrated by: Baird’s counterexample, Tsitsiklis & Van Roy examples
- Solutions: Gradient-TD Methods, LSTD
- Practical workaround: Deep Q-Network (DQN) (experience replay + target networks)