Off-Policy Divergence
Off-Policy Divergence
The phenomenon where Semi-Gradient Methods with Function Approximation and off-policy data cause the weights to grow unbounded, despite the problem being simple and well-defined.
Baird’s Counterexample
The classic demonstration: a 7-state MDP where semi-gradient TD with linear FA diverges under off-policy updates. All three elements of the Deadly Triad are present:
- Linear function approximation ✓
- Bootstrapping (TD) ✓
- Off-policy (behavior ≠ target policy) ✓
Even with small step sizes and favorable conditions, the weights diverge. This motivated Gradient-TD Methods as a theoretically sound alternative.