Chapter 11: Off-policy Methods with Approximation

Overview

Off-Policy Learning with Function Approximation is significantly harder than the on-policy case. While tabular off-policy methods extend to Semi-Gradient Methods, they do not converge as robustly. This chapter explores why these methods diverge, introduces the Deadly Triad, analyzes the geometry of linear value-function approximation, and presents algorithms with stronger convergence guarantees like Gradient-TD Methods and Emphatic-TD.


11.1 Semi-gradient Methods

To convert tabular off-policy algorithms to semi-gradient form, we replace the state-value/action-value array updates with updates to a weight vector .

Importance Sampling Ratio

For off-policy learning, we use the per-step importance sampling ratio:

Semi-gradient Off-policy TD(0)

The weight update is: where is the TD error:

  • Episodic:
  • Continuing:

Semi-gradient Expected Sarsa

This algorithm does not require importance sampling because it uses the expectation over the target policy :


11.2 Examples of Off-policy Divergence

Simple off-policy semi-gradient methods can be unstable and diverge to infinity.

The w-to-2w Counterexample

Consider two states with features and . Value estimates: , . Update: .

Divergence Condition

If , the constant is greater than 1, and diverges to .

Baird’s Counterexample

A 7-state MDP where the behavior policy chooses actions that lead to a uniform distribution over states, while the target policy always chooses the “solid” action leading to the 7th state.

graph TD
    S1[s1: 2w1 + w8] -->|solid| S7[s7: w7 + 2w8]
    S2[s2: 2w2 + w8] -->|solid| S7
    S3[s3: 2w3 + w8] -->|solid| S7
    S4[s4: 2w4 + w8] -->|solid| S7
    S5[s5: 2w5 + w8] -->|solid| S7
    S6[s6: 2w6 + w8] -->|solid| S7
    S7 -->|solid| S7
    
    S1 -.->|dashed| S123456
    S2 -.->|dashed| S123456
    S3 -.->|dashed| S123456
    S4 -.->|dashed| S123456
    S5 -.->|dashed| S123456
    S6 -.->|dashed| S123456
    S7 -.->|dashed| S123456
    
    subgraph S123456 [Upper States]
    S1
    S2
    S3
    S4
    S5
    S6
    end
  • Result: Semi-gradient TD(0) and even DP updates diverge for any .

11.3 The Deadly Triad

Stability is jeopardized when three elements are combined:

  1. Function Approximation: Linear or non-linear (ANNs).
  2. Bootstrapping: Targets based on existing estimates (TD, DP).
  3. Off-Policy Learning: Training on a distribution different from the target policy.

The Triad

Any two of these are safe, but the combination of all three often leads to Off-Policy Divergence. We cannot give up function approximation (scalability) or bootstrapping (efficiency), so we must improve off-policy learning methods.


11.4 Linear Value-function Geometry

We can view value functions as vectors in an -dimensional space. Linear approximation restricts these to a -dimensional subspace ().

Key Operators

  • Bellman Operator (): Takes a value function and produces the expected one-step return: .
  • Projection Operator (): Projects any value function back into the representable subspace: where .
  • Projection Matrix: .

Error Measures

  1. Mean Square Value Error (VE): Distance to true value function .
  2. Mean Square Bellman Error (BE): Distance between value function and its image under .
  3. Mean Square Projected Bellman Error (PBE): Distance between the projection of and .

The TD Fixed Point

The point where is the TD Fixed Point .


11.5 Gradient Descent in the Bellman Error

Attempting Stochastic Gradient Descent (SGD) on the BE:

Residual-Gradient Algorithm

Double Sampling Problem

To get an unbiased estimate of the gradient, one needs two independent samples of the next state from the same state . This is only possible in simulated environments or deterministic systems.


11.6 The Bellman Error is Not Learnable

A quantity is learnable if it can be estimated from the observed sequence of features, actions, and rewards.

  • VE is not learnable, but the parameter that minimizes it is learnable (via Monte Carlo).
  • BE is not learnable, and its minimizing parameter is also not learnable. Two different MDPs can produce identical data but have different BE-minimizing solutions (A-presplit example).
  • PBE is learnable and is the target of Gradient-TD Methods.

11.7 Gradient-TD Methods

Stable O(d) methods for minimizing PBE. They use a second weight vector to estimate a part of the gradient.

GTD2 Algorithm

TDC (Gradient-TD with Correction)

TDC Derivation

The PBE gradient can be written as: The vector learns .


11.8 Emphatic-TD Methods

Reweight updates to mimic an on-policy distribution, ensuring stability.

One-step Emphatic-TD(0)

  • : Emphasis
  • : Interest (user-defined importance of states)

Summary

ProblemSolutionStable?O(d)?
Off-policy DivergenceOn-policy training or SGDYesYes
Deadly TriadAvoid one of the threeYesVaries
Bellman ErrorResidual GradientYesNo (Double Sample)
Projected Bellman ErrorGradient-TD (TDC/GTD2)YesYes
Mismatch Dist.Emphatic-TDYesYes