The Deadly Triad

Definition

The Deadly Triad

The Deadly Triad refers to the combination of three elements that, when present together, can cause RL algorithms to become unstable and diverge:

Function Approximation — parameterized value function (not tabular)

Bootstrapping — updating estimates based on other estimates (e.g., TD, not MC)

Off-policy learning — learning about a policy different from the one generating data

Any Two Are Fine

Any two of these three elements is safe:

FA + Bootstrapping + On-policy → Semi-gradient TD converges (linear case)

FA + No bootstrapping + Off-policy → MC with importance sampling converges

Tabular + Bootstrapping + Off-policy → Q-learning converges

It’s the combination of all three that can cause divergence.

Why It Diverges

The Feedback Loop from Hell

Off-policy data comes from a different distribution than what the target policy visits

Function approximation generalizes — updating one state changes values of others

Bootstrapping uses these (potentially wrong) estimates as targets

Result: errors can amplify rather than decrease. The updates push weights in a direction that increases error on states the target policy actually visits.

Baird’s Counterexample

The classic demonstration of deadly triad divergence:

7-state MDP with specific transitions
Semi-gradient TD with linear FA
Off-policy: behavior policy (uniform random) ≠ target policy
Result: weights diverge to infinity, despite the problem being simple

Solutions and Mitigations

Approach	How it helps
Use on-policy data	Removes off-policy element
Don’t bootstrap (MC)	Removes bootstrapping element
Use tabular	Removes FA element (but defeats the purpose for large problems)
Gradient-TD Methods (GTD, GTD2, TDC)	True gradient methods that converge even off-policy with linear FA
Experience Replay + Target Network (DQN)	Stabilizes training (doesn’t guarantee convergence but works in practice)
Emphatic-TD	Reweights updates to correct for off-policy distribution

Connections

Components: Function Approximation, Bootstrapping, On-Policy vs Off-Policy
Demonstrated by: Baird’s counterexample, Tsitsiklis & Van Roy examples
Solutions: Gradient-TD Methods, LSTD
Practical workaround: Deep Q-Network (DQN) (experience replay + target networks)

Study Notes

Explorer

Deadly Triad

The Deadly Triad

Definition

Why It Diverges

Baird’s Counterexample

Solutions and Mitigations

Connections

Appears In

Graph View

Table of Contents

Backlinks