RL Lecture 7: Off-Policy RL with Approximation

Introduction

The extension of reinforcement learning to the off-policy case with function approximation is significantly harder than the on-policy case. While tabular off-policy methods extend readily to semi-gradient algorithms, these algorithms do not converge as robustly and can often diverge. This note explores the “Deadly Triad” of instability, Baird’s counterexample, and advanced algorithms designed to provide stability.


1. Episodic Semi-Gradient Control (Ch 10.1)

To extend function approximation to control, we use semi-gradient descent to update the action-value function .

Semi-Gradient Sarsa update

The update rule for the weights in semi-gradient Sarsa is:

Formula

Where:

  • is the TD error.
  • is the gradient with respect to .

Pseudocode: Episodic Semi-gradient Sarsa

Definition

Input: a differentiable action-value function parameterization Algorithm parameters: step size , small Initialize: weight vector arbitrarily

Loop for each episode:   Initialize   Choose (e.g., -greedy)   Loop for each step of episode:     Take action , observe     If is terminal:              Go to next episode     Choose          ,


2. Semi-Gradient Off-Policy Methods (Ch 11.1)

Off-policy methods learn a target policy from experience generated by a behavior policy .

Importance Sampling Ratio

To account for the discrepancy between and , we use the per-step importance sampling ratio:

Formula

Off-Policy Semi-Gradient TD(0) update

Formula

Where .


3. Examples of Off-Policy Divergence (Ch 11.2)

Instability in off-policy learning can occur even in simple MDPs with linear function approximation.

Simple 1D Example

Consider a two-state MDP with values and . A transition from the first state to the second with reward 0 gives TD error . Updating with gives . Since both states’ values depend on , the error in the next step will be larger, leading to divergence if .

Baird’s Counterexample

This is a famous 7-state episodic MDP designed to show divergence.

The Setup

  • States: 6 “upper” states and 1 “lower” state (state 7).
  • Actions:
    • Dashed action: Transitions to one of the 6 upper states with equal probability.
    • Solid action: Transitions to the 7th state.
  • Policies:
    • Behavior policy : selects dashed () and solid () actions such that the next-state distribution is uniform.
    • Target policy : always takes the solid action.
  • Features: Linear function approximation where for and .
  • Result: Semi-gradient TD(0) and even DP updates diverge to infinity for any positive , as grows unboundedly.

4. The Deadly Triad (Ch 11.3)

Instability arises when three specific elements are combined. Giving up any one removes the danger of divergence.

The Deadly Triad

  1. Function Approximation: Using non-tabular representations (linear or ANNs).
  2. Bootstrapping: Updating estimates based on other estimates (TD, DP).
  3. Off-policy Training: Training on a distribution different from the target policy’s.
  • Any two together are fine (e.g., On-policy learning with bootstrapping and FA is generally stable).
  • All three together cause potential divergence in even the simplest cases.

5. Linear Value-Function Geometry (Ch 11.4)

Understanding the error surfaces in high-dimensional value function space.

Projection Operator ()

The operator that maps an arbitrary value function to the representable subspace that is closest in the weighted norm :

Error Measures

  • Value Error (VE): Distance to the true value function .
  • Bellman Error (BE): The expectation of the TD error. .
  • Projected Bellman Error (PBE): The Bellman error projected back into the representable subspace.

Intuition

The TD fixed point is the point where the PBE is exactly zero.


6. Gradient Descent in Bellman Error (Ch 11.5)

Naive minimization of the Bellman Error doesn’t work well because it is not easily learnable from samples without independent “double sampling” of the next state.

Residual-Gradient Algorithm

A true SGD algorithm for minimizing the MSBE.

Formula

Problem: Requires two independent samples of from to be unbiased. “Naive” residual gradient (using the same sample twice) leads to poor convergence (e.g., the A-split example).


7. Bellman Error is Not Learnable (Ch 11.6)

The Bellman Error (BE) cannot be estimated from state-to-state transition data alone; it requires knowledge of the underlying MDP dynamics.

Learnability

A quantity is learnable if it can be determined from the observed distribution of experience (states, actions, rewards).

  • VE is not learnable, but its minimizer is (it’s the same as the Return Error minimizer).
  • BE is not learnable, and its minimizer is also not learnable from experience alone.

8. Gradient-TD Methods (Ch 11.7)

These methods minimize the Projected Bellman Error (PBE) and are true SGD methods of complexity. They use a second weight vector to estimate a part of the gradient.

GTD2

Formula

TDC (Temporal-Difference with Correction)

Also known as GTD(0).

Formula

Key Properties

  • Convergence: Guaranteed to the TD fixed point for linear FA even off-policy.
  • Two Time-Scale: (the “fast” scale for ) should be larger than (the “slow” scale for ).
  • Complexity: , same as standard TD.

Created for the RL course vault.