RL Lecture 6: On-Policy TD Learning with Approximation

Overview

This lecture explores how to extend Temporal Difference Learning to large or continuous state spaces using Function Approximation. We focus on on-policy prediction, where the goal is to estimate the value function for a fixed policy using parameterized functional forms instead of tables.


1. Value Function Approximation

In large state spaces, we cannot store a value for every state. Instead, we represent the value function with a weight vector : Typically, , meaning changing one weight affects many states (generalization).

Mean Squared Value Error (VE)

To evaluate the approximation, we use the weighted mean squared error over the state distribution : Where is usually the on-policy distribution (fraction of time spent in state under ).


2. Linear Function Approximation

A common and tractable case is Linear Function Approximation, where the estimate is a linear combination of features:

Linear Value Function

where is a feature vector representing state .

Gradient Descent Updates

For linear methods, the gradient with respect to is simply the feature vector:

The general Stochastic Gradient Descent (SGD) update rule is: For linear methods, this simplifies to:


3. Semi-Gradient TD(0)

When the target depends on the current weights (e.g., in Bootstrapping), the update does not follow the true gradient of the error. We call these Semi-Gradient Methods.

Semi-Gradient TD(0) Update

The TD Fixed Point

In the linear case, TD(0) converges to the TD Fixed Point , which satisfies the following system of linear equations: Where:

Convergence Bound

While Monte Carlo Methods converge to the global minimum of the VE, linear TD(0) converges to a point whose error is bounded relative to the best possible error: This expansion factor can be large if .

Pseudocode: Linear Semi-Gradient TD(0)

Algorithm: Linear Semi-Gradient TD(0) for estimating v̂ ≈ v_π
──────────────────────────────────────────────────────────────
Input: policy π, step-size α > 0
Input: differentiable v̂(s,w) = w^T · x(s)
Initialize: w arbitrarily (e.g., w = 0)
 
Loop for each episode:
  Initialize S
  Loop for each step of episode:
    Choose A ~ π(·|S)
    Take action A, observe R, S'
    If S' is terminal:
      w ← w + α[R - v̂(S,w)] · x(S)
      Go to next episode
    w ← w + α[R + γ·v̂(S',w) - v̂(S,w)] · x(S)
    S ← S'

4. Feature Construction

The performance of linear methods depends entirely on the choice of Feature Construction.

4.1 Polynomials

States are represented as powers and products of state variables.

  • Example for 2D state :
  • Allows modeling interactions but doesn’t scale well to high dimensions

4.2 Fourier Basis

Uses cosine functions of different frequencies: where is an integer vector specifying frequencies along each dimension.

Step-Size Scaling for Fourier

Konidaris et al. (2011) suggest per-feature step sizes: (except when all , use ).

4.3 Coarse Coding

Binary features representing overlapping “receptive fields” (e.g., circles in 2D). A state activates a feature if it falls inside the corresponding region.

  • Large receptive fields → broad generalization, low resolution
  • Small receptive fields → narrow generalization, high resolution
  • More features → finer discrimination but slower learning

4.4 Radial Basis Functions (RBF)

Continuous version of coarse coding. Feature value depends on distance to center :

RBF Feature

Provides smooth, differentiable approximation. Continuous-valued features (unlike binary coarse coding).


5. Tile Coding

Tile Coding is the most practically important feature construction method for RL.

Tilings and Tiles

The state space is partitioned into a grid called a tiling. Each grid cell is a tile (a binary feature). Multiple overlapping tilings, each offset from the others, are used to achieve both generalization and fine resolution.

How It Works

  1. Define tilings over the state space, each a regular grid
  2. Each tiling is offset by a fraction of the tile width
  3. For a given state : exactly one tile per tiling is active → active features total
  4. (sum of active tile weights)

Key Properties

  • Binary features: Updates are just additions to active tile weights
  • Fixed cost: Always exactly active features, regardless of state space size
  • Step-size scaling: Use to account for tilings contributing
  • Hashing: Map large tile spaces to smaller arrays using hash function — handles curse of dimensionality

Displacement Vectors

Uniform offsets (equal in all dimensions) create diagonal artifacts. Asymmetric offsets using displacement vectors like times the fundamental unit produce better, more isotropic generalization.


6. Least-Squares TD (LSTD)

Instead of iterative updates, LSTD estimates the matrix and vector directly from data to solve .

The Algorithm

Sherman-Morrison Update

To avoid matrix inversion every step, update directly in :

LSTD Trade-offs

LSTDSemi-Gradient TD
Step-size ?No (direct solution)Yes (sensitive to tuning)
Data efficiencyHigher (no data wasted)Lower (iterative)
Per-step computation
Memory (stores )

LSTD "Never Forgets"

LSTD uses all past transitions equally — the TD fixed point depends on all data ever seen. This is sample efficient but problematic if the policy or environment changes (non-stationarity).


7. Neural Network Function Approximation

Neural Network Function Approximation allows for nonlinear value functions: .

Architecture

A feedforward network maps state features through hidden layers: where is a non-linear activation function (ReLU, sigmoid, etc.).

Semi-Gradient Update with Neural Nets

Same update rule as linear case, but gradient computed via backpropagation:

Challenges of Neural Networks in RL

  1. Non-stationarity: Targets change as the network learns (bootstrapping moves the goal)
  2. Correlated data: Sequential RL data violates i.i.d. assumption of SGD
  3. Catastrophic forgetting: Learning new states can degrade performance on previously learned states
  4. No convergence guarantees: Unlike linear semi-gradient TD, non-linear methods have no guaranteed convergence

These challenges motivate the stabilization techniques in Deep Q-Network (DQN): Experience Replay and Target Network.


8. Summary: Method Comparison

Feature TypeRepresentationKey Property
State AggregationOne-hot over partitionsSimplest; piecewise constant
PolynomialsPowers of state variablesGlobal; poor scaling
Fourier BasisCosine functionsGood for smooth functions
Coarse CodingBinary overlapping regionsLocal generalization
Tile CodingMultiple offset gridsEfficient; tunable; practical
RBFGaussian bumpsSmooth; computationally expensive
Neural NetworksLearned non-linear featuresMost expressive; least stable