Chapter 11: Off-policy Methods with Approximation

Overview

Off-Policy Learning with Function Approximation is significantly harder than the on-policy case. While tabular off-policy methods extend to Semi-Gradient Methods, they do not converge as robustly. This chapter explores why these methods diverge, introduces the Deadly Triad, analyzes the geometry of linear value-function approximation, and presents algorithms with stronger convergence guarantees like Gradient-TD Methods and Emphatic-TD.

11.1 Semi-gradient Methods

To convert tabular off-policy algorithms to semi-gradient form, we replace the state-value/action-value array updates with updates to a weight vector $w$ .

Importance Sampling Ratio

For off-policy learning, we use the per-step importance sampling ratio: $ρ_{t} = ρ_{t : t} = \frac{π ( A _{t} ∣ S _{t} )}{b ( A _{t} ∣ S _{t} )}$

Semi-gradient Off-policy TD(0)

The weight update is: $w_{t + 1} = w_{t} + α ρ_{t} δ_{t} \nabla \overset{v}{^} (S_{t}, w_{t})$ where $δ_{t}$ is the TD error:

Episodic: $δ_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})$
Continuing: $δ_{t} = R_{t + 1} - \overset{ˉ}{R}_{t} + \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})$

Semi-gradient Expected Sarsa

This algorithm does not require importance sampling because it uses the expectation over the target policy $π$ : $w_{t + 1} = w_{t} + α δ_{t} \nabla \overset{q}{^} (S_{t}, A_{t}, w_{t})$ $δ_{t} = R_{t + 1} + γ \sum_{a} π (a ∣ S_{t + 1}) \overset{q}{^} (S_{t + 1}, a, w_{t}) - \overset{q}{^} (S_{t}, A_{t}, w_{t})$

11.2 Examples of Off-policy Divergence

Simple off-policy semi-gradient methods can be unstable and diverge to infinity.

The w-to-2w Counterexample

Consider two states with features $x (s_{1}) = 1$ and $x (s_{2}) = 2$ . Value estimates: $\overset{v}{^} (s_{1}, w) = w$ , $\overset{v}{^} (s_{2}, w) = 2 w$ . Update: $w_{t + 1} = w_{t} + α (2 γ - 1) w_{t} = [1 + α (2 γ - 1)] w_{t}$ .

Divergence Condition

If $γ > 0.5$ , the constant $1 + α (2 γ - 1)$ is greater than 1, and $w$ diverges to $\pm \infty$ .

Baird’s Counterexample

A 7-state MDP where the behavior policy $b$ chooses actions that lead to a uniform distribution over states, while the target policy $π$ always chooses the “solid” action leading to the 7th state.

graph TD
    S1[s1: 2w1 + w8] -->|solid| S7[s7: w7 + 2w8]
    S2[s2: 2w2 + w8] -->|solid| S7
    S3[s3: 2w3 + w8] -->|solid| S7
    S4[s4: 2w4 + w8] -->|solid| S7
    S5[s5: 2w5 + w8] -->|solid| S7
    S6[s6: 2w6 + w8] -->|solid| S7
    S7 -->|solid| S7
    
    S1 -.->|dashed| S123456
    S2 -.->|dashed| S123456
    S3 -.->|dashed| S123456
    S4 -.->|dashed| S123456
    S5 -.->|dashed| S123456
    S6 -.->|dashed| S123456
    S7 -.->|dashed| S123456
    
    subgraph S123456 [Upper States]
    S1
    S2
    S3
    S4
    S5
    S6
    end

Result: Semi-gradient TD(0) and even DP updates diverge for any $α > 0$ .

11.3 The Deadly Triad

Stability is jeopardized when three elements are combined:

Function Approximation: Linear or non-linear (ANNs).
Bootstrapping: Targets based on existing estimates (TD, DP).
Off-Policy Learning: Training on a distribution different from the target policy.

The Triad

Any two of these are safe, but the combination of all three often leads to Off-Policy Divergence. We cannot give up function approximation (scalability) or bootstrapping (efficiency), so we must improve off-policy learning methods.

11.4 Linear Value-function Geometry

We can view value functions as vectors in an $∣ S ∣$ -dimensional space. Linear approximation restricts these to a $d$ -dimensional subspace ( $d ≪ ∣ S ∣$ ).

Key Operators

Bellman Operator ( $B_{π}$ ): Takes a value function $v$ and produces the expected one-step return: $(B_{π} v) (s) = \sum_{a, s^{'}, r} π (a ∣ s) p (s^{'}, r ∣ s, a) [r + γ v (s^{'})]$ .
Projection Operator ( $Π$ ): Projects any value function $v$ back into the representable subspace: $Π v = \overset{v}{^}_{w}$ where $w = ar g min_{w} ∥ v - \overset{v}{^}_{w} ∥_{μ}^{2}$ .
Projection Matrix: $Π = X (X^{⊤} D X)^{- 1} X^{⊤} D$ .

Error Measures

Mean Square Value Error (VE): Distance to true value function $v_{π}$ . $V E (w) = ∥ v_{π} - \overset{v}{^}_{w} ∥_{μ}^{2}$
Mean Square Bellman Error (BE): Distance between value function and its image under $B_{π}$ . $BE (w) = ∥ B_{π} \overset{v}{^}_{w} - \overset{v}{^}_{w} ∥_{μ}^{2}$
Mean Square Projected Bellman Error (PBE): Distance between the projection of $B_{π} \overset{v}{^}_{w}$ and $\overset{v}{^}_{w}$ . $PBE (w) = ∥Π B_{π} \overset{v}{^}_{w} - \overset{v}{^}_{w} ∥_{μ}^{2}$

The TD Fixed Point

The point where $PBE (w) = 0$ is the TD Fixed Point $w_{T D}$ .

11.5 Gradient Descent in the Bellman Error

Attempting Stochastic Gradient Descent (SGD) on the BE:

Residual-Gradient Algorithm

$w_{t + 1} = w_{t} + α ρ_{t} [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})] [\nabla \overset{v}{^} (S_{t}, w_{t}) - γ \nabla \overset{v}{^} (S_{t + 1}, w_{t})]$

Double Sampling Problem

To get an unbiased estimate of the gradient, one needs two independent samples of the next state $S_{t + 1}$ from the same state $S_{t}$ . This is only possible in simulated environments or deterministic systems.

11.6 The Bellman Error is Not Learnable

A quantity is learnable if it can be estimated from the observed sequence of features, actions, and rewards.

VE is not learnable, but the parameter $w$ that minimizes it is learnable (via Monte Carlo).
BE is not learnable, and its minimizing parameter $w$ is also not learnable. Two different MDPs can produce identical data but have different BE-minimizing solutions (A-presplit example).
PBE is learnable and is the target of Gradient-TD Methods.

11.7 Gradient-TD Methods

Stable O(d) methods for minimizing PBE. They use a second weight vector $v$ to estimate a part of the gradient.

GTD2 Algorithm

$w_{t + 1} = w_{t} + α ρ_{t} (x_{t} - γ x_{t + 1}) (x_{t}^{⊤} v_{t})$ $v_{t + 1} = v_{t} + β ρ_{t} [δ_{t} - (x_{t}^{⊤} v_{t})] x_{t}$

TDC (Gradient-TD with Correction)

$w_{t + 1} = w_{t} + α ρ_{t} [δ_{t} x_{t} - γ x_{t + 1} (x_{t}^{⊤} v_{t})]$ $v_{t + 1} = v_{t} + β ρ_{t} [δ_{t} - (x_{t}^{⊤} v_{t})] x_{t}$

TDC Derivation

The PBE gradient can be written as: $\nabla PBE (w) = - 2 E [ρ δ x] + 2 γ E [ρ x_{t + 1} x_{t}^{⊤}] E [x x^{⊤}]^{- 1} E [ρ δ x]$ The vector $v$ learns $E [x x^{⊤}]^{- 1} E [ρ δ x]$ .

11.8 Emphatic-TD Methods

Reweight updates to mimic an on-policy distribution, ensuring stability.

One-step Emphatic-TD(0)

$M_{t} = γ ρ_{t - 1} M_{t - 1} + I_{t}$ $w_{t + 1} = w_{t} + α M_{t} ρ_{t} δ_{t} x_{t}$

$M_{t}$ : Emphasis
$I_{t}$ : Interest (user-defined importance of states)

Summary

Problem	Solution	Stable?	O(d)?
Off-policy Divergence	On-policy training or SGD	Yes	Yes
Deadly Triad	Avoid one of the three	Yes	Varies
Bellman Error	Residual Gradient	Yes	No (Double Sample)
Projected Bellman Error	Gradient-TD (TDC/GTD2)	Yes	Yes
Mismatch Dist.	Emphatic-TD	Yes	Yes

Study Notes

Explorer

RL-Book Ch11 - Off-Policy Methods with Approximation

Chapter 11: Off-policy Methods with Approximation

Overview

11.1 Semi-gradient Methods

Importance Sampling Ratio

Semi-gradient Off-policy TD(0)

11.2 Examples of Off-policy Divergence

The w-to-2w Counterexample

Baird’s Counterexample

11.3 The Deadly Triad

11.4 Linear Value-function Geometry

Key Operators

Error Measures

11.5 Gradient Descent in the Bellman Error

Residual-Gradient Algorithm

11.6 The Bellman Error is Not Learnable

11.7 Gradient-TD Methods

GTD2 Algorithm

TDC (Gradient-TD with Correction)

11.8 Emphatic-TD Methods

One-step Emphatic-TD(0)

Summary

Graph View

Table of Contents

Backlinks