Chapter 11: Off-policy Methods with Approximation
Overview
Off-Policy Learning with Function Approximation is significantly harder than the on-policy case. While tabular off-policy methods extend to Semi-Gradient Methods, they do not converge as robustly. This chapter explores why these methods diverge, introduces the Deadly Triad, analyzes the geometry of linear value-function approximation, and presents algorithms with stronger convergence guarantees like Gradient-TD Methods and Emphatic-TD.
11.1 Semi-gradient Methods
To convert tabular off-policy algorithms to semi-gradient form, we replace the state-value/action-value array updates with updates to a weight vector .
Importance Sampling Ratio
For off-policy learning, we use the per-step importance sampling ratio:
Semi-gradient Off-policy TD(0)
The weight update is: where is the TD error:
- Episodic:
- Continuing:
Semi-gradient Expected Sarsa
This algorithm does not require importance sampling because it uses the expectation over the target policy :
11.2 Examples of Off-policy Divergence
Simple off-policy semi-gradient methods can be unstable and diverge to infinity.
The w-to-2w Counterexample
Consider two states with features and . Value estimates: , . Update: .
Divergence Condition
If , the constant is greater than 1, and diverges to .
Baird’s Counterexample
A 7-state MDP where the behavior policy chooses actions that lead to a uniform distribution over states, while the target policy always chooses the “solid” action leading to the 7th state.
graph TD S1[s1: 2w1 + w8] -->|solid| S7[s7: w7 + 2w8] S2[s2: 2w2 + w8] -->|solid| S7 S3[s3: 2w3 + w8] -->|solid| S7 S4[s4: 2w4 + w8] -->|solid| S7 S5[s5: 2w5 + w8] -->|solid| S7 S6[s6: 2w6 + w8] -->|solid| S7 S7 -->|solid| S7 S1 -.->|dashed| S123456 S2 -.->|dashed| S123456 S3 -.->|dashed| S123456 S4 -.->|dashed| S123456 S5 -.->|dashed| S123456 S6 -.->|dashed| S123456 S7 -.->|dashed| S123456 subgraph S123456 [Upper States] S1 S2 S3 S4 S5 S6 end
- Result: Semi-gradient TD(0) and even DP updates diverge for any .
11.3 The Deadly Triad
Stability is jeopardized when three elements are combined:
- Function Approximation: Linear or non-linear (ANNs).
- Bootstrapping: Targets based on existing estimates (TD, DP).
- Off-Policy Learning: Training on a distribution different from the target policy.
The Triad
Any two of these are safe, but the combination of all three often leads to Off-Policy Divergence. We cannot give up function approximation (scalability) or bootstrapping (efficiency), so we must improve off-policy learning methods.
11.4 Linear Value-function Geometry
We can view value functions as vectors in an -dimensional space. Linear approximation restricts these to a -dimensional subspace ().
Key Operators
- Bellman Operator (): Takes a value function and produces the expected one-step return: .
- Projection Operator (): Projects any value function back into the representable subspace: where .
- Projection Matrix: .
Error Measures
- Mean Square Value Error (VE): Distance to true value function .
- Mean Square Bellman Error (BE): Distance between value function and its image under .
- Mean Square Projected Bellman Error (PBE): Distance between the projection of and .
The TD Fixed Point
The point where is the TD Fixed Point .
11.5 Gradient Descent in the Bellman Error
Attempting Stochastic Gradient Descent (SGD) on the BE:
Residual-Gradient Algorithm
Double Sampling Problem
To get an unbiased estimate of the gradient, one needs two independent samples of the next state from the same state . This is only possible in simulated environments or deterministic systems.
11.6 The Bellman Error is Not Learnable
A quantity is learnable if it can be estimated from the observed sequence of features, actions, and rewards.
- VE is not learnable, but the parameter that minimizes it is learnable (via Monte Carlo).
- BE is not learnable, and its minimizing parameter is also not learnable. Two different MDPs can produce identical data but have different BE-minimizing solutions (A-presplit example).
- PBE is learnable and is the target of Gradient-TD Methods.
11.7 Gradient-TD Methods
Stable O(d) methods for minimizing PBE. They use a second weight vector to estimate a part of the gradient.
GTD2 Algorithm
TDC (Gradient-TD with Correction)
TDC Derivation
The PBE gradient can be written as: The vector learns .
11.8 Emphatic-TD Methods
Reweight updates to mimic an on-policy distribution, ensuring stability.
One-step Emphatic-TD(0)
- : Emphasis
- : Interest (user-defined importance of states)
Summary
| Problem | Solution | Stable? | O(d)? |
|---|---|---|---|
| Off-policy Divergence | On-policy training or SGD | Yes | Yes |
| Deadly Triad | Avoid one of the three | Yes | Varies |
| Bellman Error | Residual Gradient | Yes | No (Double Sample) |
| Projected Bellman Error | Gradient-TD (TDC/GTD2) | Yes | Yes |
| Mismatch Dist. | Emphatic-TD | Yes | Yes |