RL Lecture 7: Off-Policy RL with Approximation
Introduction
The extension of reinforcement learning to the off-policy case with function approximation is significantly harder than the on-policy case. While tabular off-policy methods extend readily to semi-gradient algorithms, these algorithms do not converge as robustly and can often diverge. This note explores the “Deadly Triad” of instability, Baird’s counterexample, and advanced algorithms designed to provide stability.
1. Episodic Semi-Gradient Control (Ch 10.1)
To extend function approximation to control, we use semi-gradient descent to update the action-value function .
Semi-Gradient Sarsa update
The update rule for the weights in semi-gradient Sarsa is:
Formula
Where:
- is the TD error.
- is the gradient with respect to .
Pseudocode: Episodic Semi-gradient Sarsa
Definition
Input: a differentiable action-value function parameterization Algorithm parameters: step size , small Initialize: weight vector arbitrarily
Loop for each episode: Initialize Choose (e.g., -greedy) Loop for each step of episode: Take action , observe If is terminal: Go to next episode Choose ,
2. Semi-Gradient Off-Policy Methods (Ch 11.1)
Off-policy methods learn a target policy from experience generated by a behavior policy .
Importance Sampling Ratio
To account for the discrepancy between and , we use the per-step importance sampling ratio:
Formula
Off-Policy Semi-Gradient TD(0) update
Formula
Where .
3. Examples of Off-Policy Divergence (Ch 11.2)
Instability in off-policy learning can occur even in simple MDPs with linear function approximation.
Simple 1D Example
Consider a two-state MDP with values and . A transition from the first state to the second with reward 0 gives TD error . Updating with gives . Since both states’ values depend on , the error in the next step will be larger, leading to divergence if .
Baird’s Counterexample
This is a famous 7-state episodic MDP designed to show divergence.
The Setup
- States: 6 “upper” states and 1 “lower” state (state 7).
- Actions:
- Dashed action: Transitions to one of the 6 upper states with equal probability.
- Solid action: Transitions to the 7th state.
- Policies:
- Behavior policy : selects dashed () and solid () actions such that the next-state distribution is uniform.
- Target policy : always takes the solid action.
- Features: Linear function approximation where for and .
- Result: Semi-gradient TD(0) and even DP updates diverge to infinity for any positive , as grows unboundedly.
4. The Deadly Triad (Ch 11.3)
Instability arises when three specific elements are combined. Giving up any one removes the danger of divergence.
The Deadly Triad
- Function Approximation: Using non-tabular representations (linear or ANNs).
- Bootstrapping: Updating estimates based on other estimates (TD, DP).
- Off-policy Training: Training on a distribution different from the target policy’s.
- Any two together are fine (e.g., On-policy learning with bootstrapping and FA is generally stable).
- All three together cause potential divergence in even the simplest cases.
5. Linear Value-Function Geometry (Ch 11.4)
Understanding the error surfaces in high-dimensional value function space.
Projection Operator ()
The operator that maps an arbitrary value function to the representable subspace that is closest in the weighted norm :
Error Measures
- Value Error (VE): Distance to the true value function .
- Bellman Error (BE): The expectation of the TD error. .
- Projected Bellman Error (PBE): The Bellman error projected back into the representable subspace.
Intuition
The TD fixed point is the point where the PBE is exactly zero.
6. Gradient Descent in Bellman Error (Ch 11.5)
Naive minimization of the Bellman Error doesn’t work well because it is not easily learnable from samples without independent “double sampling” of the next state.
Residual-Gradient Algorithm
A true SGD algorithm for minimizing the MSBE.
Formula
Problem: Requires two independent samples of from to be unbiased. “Naive” residual gradient (using the same sample twice) leads to poor convergence (e.g., the A-split example).
7. Bellman Error is Not Learnable (Ch 11.6)
The Bellman Error (BE) cannot be estimated from state-to-state transition data alone; it requires knowledge of the underlying MDP dynamics.
Learnability
A quantity is learnable if it can be determined from the observed distribution of experience (states, actions, rewards).
- VE is not learnable, but its minimizer is (it’s the same as the Return Error minimizer).
- BE is not learnable, and its minimizer is also not learnable from experience alone.
8. Gradient-TD Methods (Ch 11.7)
These methods minimize the Projected Bellman Error (PBE) and are true SGD methods of complexity. They use a second weight vector to estimate a part of the gradient.
GTD2
Formula
TDC (Temporal-Difference with Correction)
Also known as GTD(0).
Formula
Key Properties
- Convergence: Guaranteed to the TD fixed point for linear FA even off-policy.
- Two Time-Scale: (the “fast” scale for ) should be larger than (the “slow” scale for ).
- Complexity: , same as standard TD.
Created for the RL course vault.