RL Lecture 7: Off-Policy RL with Approximation

Introduction

The extension of reinforcement learning to the off-policy case with function approximation is significantly harder than the on-policy case. While tabular off-policy methods extend readily to semi-gradient algorithms, these algorithms do not converge as robustly and can often diverge. This note explores the “Deadly Triad” of instability, Baird’s counterexample, and advanced algorithms designed to provide stability.

1. Episodic Semi-Gradient Control (Ch 10.1)

To extend function approximation to control, we use semi-gradient descent to update the action-value function $q (s, a, w)$ .

Semi-Gradient Sarsa update

The update rule for the weights $w$ in semi-gradient Sarsa is:

Formula

$w_{t + 1} = w_{t} + α [R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w_{t}) - \overset{q}{^} (S_{t}, A_{t}, w_{t})] \nabla \overset{q}{^} (S_{t}, A_{t}, w_{t})$ Where:

$δ_{t} = R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w_{t}) - \overset{q}{^} (S_{t}, A_{t}, w_{t})$ is the TD error.

$\nabla \overset{q}{^} (S_{t}, A_{t}, w_{t})$ is the gradient with respect to $w$ .

Pseudocode: Episodic Semi-gradient Sarsa

Definition

Input: a differentiable action-value function parameterization $\overset{q}{^} (s, a, w)$ Algorithm parameters: step size $α > 0$ , small $ϵ > 0$ Initialize: weight vector $w \in R^{d}$ arbitrarily

Loop for each episode: Initialize $S$ Choose $A \sim π (\cdot ∣ S)$ (e.g., $ϵ$ -greedy) Loop for each step of episode: Take action $A$ , observe $R, S^{'}$ If $S^{'}$ is terminal: $w \leftarrow w + α [R - \overset{q}{^} (S, A, w)] \nabla \overset{q}{^} (S, A, w)$ Go to next episode Choose $A^{'} \sim π (\cdot ∣ S^{'})$ $w \leftarrow w + α [R + γ \overset{q}{^} (S^{'}, A^{'}, w) - \overset{q}{^} (S, A, w)] \nabla \overset{q}{^} (S, A, w)$ $S \leftarrow S^{'}$ , $A \leftarrow A^{'}$

2. Semi-Gradient Off-Policy Methods (Ch 11.1)

Off-policy methods learn a target policy $π$ from experience generated by a behavior policy $b$ .

Importance Sampling Ratio

To account for the discrepancy between $π$ and $b$ , we use the per-step importance sampling ratio:

Formula

$ρ_{t} ≐ \frac{π ( A _{t} ∣ S _{t} )}{b ( A _{t} ∣ S _{t} )}$

Off-Policy Semi-Gradient TD(0) update

Formula

$w_{t + 1} = w_{t} + α ρ_{t} δ_{t} \nabla \overset{v}{^} (S_{t}, w_{t})$ Where $δ_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})$ .

3. Examples of Off-Policy Divergence (Ch 11.2)

Instability in off-policy learning can occur even in simple MDPs with linear function approximation.

Simple 1D Example

Consider a two-state MDP with values $w$ and $2 w$ . A transition from the first state to the second with reward 0 gives TD error $δ = 2 w - w = w$ . Updating $w$ with $α = 0.1$ gives $w \leftarrow 1.1 w$ . Since both states’ values depend on $w$ , the error in the next step will be larger, leading to divergence if $γ > 0.5$ .

Baird’s Counterexample

This is a famous 7-state episodic MDP designed to show divergence.

The Setup

States: 6 “upper” states and 1 “lower” state (state 7).

Actions:

Dashed action: Transitions to one of the 6 upper states with equal probability.

Solid action: Transitions to the 7th state.

Policies:

Behavior policy $b$ : selects dashed ( $6/7$ ) and solid ( $1/7$ ) actions such that the next-state distribution is uniform.

Target policy $π$ : always takes the solid action.

Features: Linear function approximation where $v (s_{i}) = 2 w_{i} + w_{8}$ for $i = 1..6$ and $v (s_{7}) = w_{7} + 2 w_{8}$ .

Result: Semi-gradient TD(0) and even DP updates diverge to infinity for any positive $α$ , as $w_{8}$ grows unboundedly.

4. The Deadly Triad (Ch 11.3)

Instability arises when three specific elements are combined. Giving up any one removes the danger of divergence.

The Deadly Triad

Function Approximation: Using non-tabular representations (linear or ANNs).

Bootstrapping: Updating estimates based on other estimates (TD, DP).

Off-policy Training: Training on a distribution different from the target policy’s.

Any two together are fine (e.g., On-policy learning with bootstrapping and FA is generally stable).
All three together cause potential divergence in even the simplest cases.

5. Linear Value-Function Geometry (Ch 11.4)

Understanding the error surfaces in high-dimensional value function space.

Projection Operator ( $Π$ )

The operator that maps an arbitrary value function $v$ to the representable subspace that is closest in the weighted norm $∥ \cdot ∥_{μ}$ : $Π v = X (X^{⊤} DX)^{- 1} X^{⊤} D v$

Error Measures

Value Error (VE): Distance to the true value function $v_{π}$ . $V E (w) = ∥ v_{w} - v_{π} ∥_{μ}^{2}$
Bellman Error (BE): The expectation of the TD error. $\overset{ˉ}{δ}_{w} = B_{π} v_{w} - v_{w}$ . $BE (w) = ∥ \overset{ˉ}{δ}_{w} ∥_{μ}^{2}$
Projected Bellman Error (PBE): The Bellman error projected back into the representable subspace. $PBE (w) = ∥Π \overset{ˉ}{δ}_{w} ∥_{μ}^{2}$

Intuition

The TD fixed point $w_{T D}$ is the point where the PBE is exactly zero.

6. Gradient Descent in Bellman Error (Ch 11.5)

Naive minimization of the Bellman Error doesn’t work well because it is not easily learnable from samples without independent “double sampling” of the next state.

Residual-Gradient Algorithm

A true SGD algorithm for minimizing the MSBE.

Formula

$w_{t + 1} = w_{t} + α [ρ_{t} (R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w))] [ρ_{t} (\nabla \overset{v}{^} (S_{t}, w) - γ \nabla \overset{v}{^} (S_{t + 1}, w))]$

Problem: Requires two independent samples of $S_{t + 1}$ from $S_{t}$ to be unbiased. “Naive” residual gradient (using the same sample twice) leads to poor convergence (e.g., the A-split example).

7. Bellman Error is Not Learnable (Ch 11.6)

The Bellman Error (BE) cannot be estimated from state-to-state transition data alone; it requires knowledge of the underlying MDP dynamics.

Learnability

A quantity is learnable if it can be determined from the observed distribution of experience (states, actions, rewards).

VE is not learnable, but its minimizer is (it’s the same as the Return Error minimizer).

BE is not learnable, and its minimizer is also not learnable from experience alone.

8. Gradient-TD Methods (Ch 11.7)

These methods minimize the Projected Bellman Error (PBE) and are true SGD methods of $O (d)$ complexity. They use a second weight vector $v$ to estimate a part of the gradient.

GTD2

Formula

$v_{t + 1} = v_{t} + β [ρ_{t} δ_{t} - v_{t}^{⊤} x_{t}] x_{t}$ $w_{t + 1} = w_{t} + α ρ_{t} (x_{t} - γ x_{t + 1}) (x_{t}^{⊤} v_{t})$

TDC (Temporal-Difference with Correction)

Also known as GTD(0).

Formula

$v_{t + 1} = v_{t} + β [ρ_{t} δ_{t} - v_{t}^{⊤} x_{t}] x_{t}$ $w_{t + 1} = w_{t} + α ρ_{t} [δ_{t} x_{t} - γ x_{t + 1} (x_{t}^{⊤} v_{t})]$

Key Properties

Convergence: Guaranteed to the TD fixed point for linear FA even off-policy.
Two Time-Scale: $β$ (the “fast” scale for $v$ ) should be larger than $α$ (the “slow” scale for $w$ ).
Complexity: $O (d)$ , same as standard TD.

Created for the RL course vault.

Study Notes

Explorer

RL-L07 - Off-Policy RL with Approximation

RL Lecture 7: Off-Policy RL with Approximation

Introduction

1. Episodic Semi-Gradient Control (Ch 10.1)

Semi-Gradient Sarsa update

Pseudocode: Episodic Semi-gradient Sarsa

2. Semi-Gradient Off-Policy Methods (Ch 11.1)

Importance Sampling Ratio

Off-Policy Semi-Gradient TD(0) update

3. Examples of Off-Policy Divergence (Ch 11.2)

Simple 1D Example

Baird’s Counterexample

4. The Deadly Triad (Ch 11.3)

5. Linear Value-Function Geometry (Ch 11.4)

Error Measures

6. Gradient Descent in Bellman Error (Ch 11.5)

Residual-Gradient Algorithm

7. Bellman Error is Not Learnable (Ch 11.6)

8. Gradient-TD Methods (Ch 11.7)

GTD2

TDC (Temporal-Difference with Correction)

Key Properties

Graph View

Table of Contents

Backlinks