RL Lecture 6: On-Policy TD Learning with Approximation

Overview

This lecture explores how to extend Temporal Difference Learning to large or continuous state spaces using Function Approximation. We focus on on-policy prediction, where the goal is to estimate the value function $v_{π}$ for a fixed policy $π$ using parameterized functional forms instead of tables.

1. Value Function Approximation

In large state spaces, we cannot store a value for every state. Instead, we represent the value function with a weight vector $w \in R^{d}$ : $\overset{v}{^} (s, w) \approx v_{π} (s)$ Typically, $d ≪ ∣ S ∣$ , meaning changing one weight affects many states (generalization).

Mean Squared Value Error (VE)

To evaluate the approximation, we use the weighted mean squared error over the state distribution $μ (s)$ : $\overline{V E} (w) ≐ \sum_{s \in S} μ (s) [v_{π} (s) - \overset{v}{^} (s, w)]^{2}$ Where $μ (s)$ is usually the on-policy distribution (fraction of time spent in state $s$ under $π$ ).

2. Linear Function Approximation

A common and tractable case is Linear Function Approximation, where the estimate is a linear combination of features:

Linear Value Function

$\overset{v}{^} (s, w) ≐ w^{⊤} x (s) = \sum_{i = 1}^{d} w_{i} x_{i} (s)$ where $x (s)$ is a feature vector representing state $s$ .

Gradient Descent Updates

For linear methods, the gradient with respect to $w$ is simply the feature vector: $\nabla \overset{v}{^} (s, w) = x (s)$

The general Stochastic Gradient Descent (SGD) update rule is: $w_{t + 1} = w_{t} + α [U_{t} - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$ For linear methods, this simplifies to: $w_{t + 1} = w_{t} + α [U_{t} - \overset{v}{^} (S_{t}, w_{t})] x (S_{t})$

3. Semi-Gradient TD(0)

When the target $U_{t}$ depends on the current weights $w_{t}$ (e.g., in Bootstrapping), the update does not follow the true gradient of the error. We call these Semi-Gradient Methods.

Semi-Gradient TD(0) Update

$w_{t + 1} = w_{t} + α [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$

The TD Fixed Point

In the linear case, TD(0) converges to the TD Fixed Point $w_{T D}$ , which satisfies the following system of linear equations: $A w_{T D} = b$ Where:

$A ≐ E [x_{t} (x_{t} - γ x_{t + 1})^{⊤}]$
$b ≐ E [R_{t + 1} x_{t}]$

Convergence Bound

While Monte Carlo Methods converge to the global minimum of the VE, linear TD(0) converges to a point $w_{T D}$ whose error is bounded relative to the best possible error: $\overline{V E} (w_{T D}) \leq \frac{1}{1 - γ} min_{w} \overline{V E} (w)$ This expansion factor can be large if $γ \approx 1$ .

Pseudocode: Linear Semi-Gradient TD(0)

Algorithm: Linear Semi-Gradient TD(0) for estimating v̂ ≈ v_π
──────────────────────────────────────────────────────────────
Input: policy π, step-size α > 0
Input: differentiable v̂(s,w) = w^T · x(s)
Initialize: w arbitrarily (e.g., w = 0)
 
Loop for each episode:
  Initialize S
  Loop for each step of episode:
    Choose A ~ π(·|S)
    Take action A, observe R, S'
    If S' is terminal:
      w ← w + α[R - v̂(S,w)] · x(S)
      Go to next episode
    w ← w + α[R + γ·v̂(S',w) - v̂(S,w)] · x(S)
    S ← S'

4. Feature Construction

The performance of linear methods depends entirely on the choice of Feature Construction.

4.1 Polynomials

States are represented as powers and products of state variables.

Example for 2D state $(s_{1}, s_{2})$ : $x (s) = (1, s_{1}, s_{2}, s_{1} s_{2}, s_{1}^{2}, s_{2}^{2}, \dots)$
Allows modeling interactions but doesn’t scale well to high dimensions

4.2 Fourier Basis

Uses cosine functions of different frequencies: $x_{i} (s) = cos (π s^{⊤} c^{i})$ where $c^{i}$ is an integer vector specifying frequencies along each dimension.

Step-Size Scaling for Fourier

Konidaris et al. (2011) suggest per-feature step sizes: $α_{i} = α / (c_{i 1})^{2} + \dots + (c_{ik})^{2}$ (except when all $c_{ij} = 0$ , use $α_{i} = α$ ).

4.3 Coarse Coding

Binary features representing overlapping “receptive fields” (e.g., circles in 2D). A state activates a feature if it falls inside the corresponding region.

Large receptive fields → broad generalization, low resolution
Small receptive fields → narrow generalization, high resolution
More features → finer discrimination but slower learning

4.4 Radial Basis Functions (RBF)

Continuous version of coarse coding. Feature value depends on distance to center $c_{i}$ :

RBF Feature

$x_{i} (s) = exp (- \frac{∥ s - c _{i} ∥ ^{2}}{2 σ _{i}^{2}})$

Provides smooth, differentiable approximation. Continuous-valued features (unlike binary coarse coding).

5. Tile Coding

Tile Coding is the most practically important feature construction method for RL.

Tilings and Tiles

The state space is partitioned into a grid called a tiling. Each grid cell is a tile (a binary feature). Multiple overlapping tilings, each offset from the others, are used to achieve both generalization and fine resolution.

How It Works

Define $n$ tilings over the state space, each a regular grid
Each tiling is offset by a fraction of the tile width
For a given state $s$ : exactly one tile per tiling is active → $n$ active features total
$\overset{v}{^} (s, w) = \sum_{active tiles i} w_{i}$ (sum of active tile weights)

Key Properties

Binary features: Updates are just additions to active tile weights
Fixed cost: Always exactly $n$ active features, regardless of state space size
Step-size scaling: Use $α = α_{0} / n$ to account for $n$ tilings contributing
Hashing: Map large tile spaces to smaller arrays using hash function — handles curse of dimensionality

Displacement Vectors

Uniform offsets (equal in all dimensions) create diagonal artifacts. Asymmetric offsets using displacement vectors like $(1, 3, 5, \dots)$ times the fundamental unit produce better, more isotropic generalization.

6. Least-Squares TD (LSTD)

Instead of iterative updates, LSTD estimates the $A$ matrix and $b$ vector directly from data to solve $w = A^{- 1} b$ .

The Algorithm

$\hat{A}_{t} ≐ \sum_{k = 0}^{t - 1} x_{k} (x_{k} - γ x_{k + 1})^{⊤} + ϵ I$
$\hat{b}_{t} ≐ \sum_{k = 0}^{t - 1} R_{k + 1} x_{k}$

Sherman-Morrison Update

To avoid $O (d^{3})$ matrix inversion every step, update $A^{- 1}$ directly in $O (d^{2})$ : $\hat{A}_{t}^{- 1} = \hat{A}_{t - 1}^{- 1} - \frac{A ^ _{t - 1}^{- 1} x _{t - 1} ( x _{t - 1} - γ x _{t} ) ^{⊤} A ^ _{t - 1}^{- 1}}{1 + ( x _{t - 1} - γ x _{t} ) ^{⊤} A ^ _{t - 1}^{- 1} x _{t - 1}}$

LSTD Trade-offs

	LSTD	Semi-Gradient TD
Step-size $α$ ?	No (direct solution)	Yes (sensitive to tuning)
Data efficiency	Higher (no data wasted)	Lower (iterative)
Per-step computation	$O (d^{2})$	$O (d)$
Memory	$O (d^{2})$ (stores $A^{- 1}$ )	$O (d)$

LSTD "Never Forgets"

LSTD uses all past transitions equally — the TD fixed point depends on all data ever seen. This is sample efficient but problematic if the policy or environment changes (non-stationarity).

7. Neural Network Function Approximation

Neural Network Function Approximation allows for nonlinear value functions: $\overset{v}{^} (s, w) = NN_{w} (s)$ .

Architecture

A feedforward network maps state features through hidden layers: $\overset{v}{^} (s, W^{(1)}, W^{(2)}) = \sum_{m = 0}^{M} w_{m}^{(2)} h (\sum_{d = 0}^{D} w_{m d}^{(1)} s_{d})$ where $h (\cdot)$ is a non-linear activation function (ReLU, sigmoid, etc.).

Semi-Gradient Update with Neural Nets

Same update rule as linear case, but gradient computed via backpropagation: $w_{t + 1} = w_{t} + α δ_{t} \nabla_{w} \overset{v}{^} (S_{t}, w_{t})$

Challenges of Neural Networks in RL

Non-stationarity: Targets change as the network learns (bootstrapping moves the goal)

Correlated data: Sequential RL data violates i.i.d. assumption of SGD

Catastrophic forgetting: Learning new states can degrade performance on previously learned states

No convergence guarantees: Unlike linear semi-gradient TD, non-linear methods have no guaranteed convergence

These challenges motivate the stabilization techniques in Deep Q-Network (DQN): Experience Replay and Target Network.

8. Summary: Method Comparison

Feature Type	Representation	Key Property
State Aggregation	One-hot over partitions	Simplest; piecewise constant
Polynomials	Powers of state variables	Global; poor scaling
Fourier Basis	Cosine functions	Good for smooth functions
Coarse Coding	Binary overlapping regions	Local generalization
Tile Coding	Multiple offset grids	Efficient; tunable; practical
RBF	Gaussian bumps	Smooth; computationally expensive
Neural Networks	Learned non-linear features	Most expressive; least stable

Study Notes

Explorer

RL-L06 - On-Policy TD with Approximation

RL Lecture 6: On-Policy TD Learning with Approximation

Overview

1. Value Function Approximation

2. Linear Function Approximation

Gradient Descent Updates

3. Semi-Gradient TD(0)

The TD Fixed Point

Pseudocode: Linear Semi-Gradient TD(0)

4. Feature Construction

4.1 Polynomials

4.2 Fourier Basis

4.3 Coarse Coding

4.4 Radial Basis Functions (RBF)

5. Tile Coding

How It Works

Key Properties

6. Least-Squares TD (LSTD)

The Algorithm

LSTD Trade-offs

7. Neural Network Function Approximation

Architecture

Semi-Gradient Update with Neural Nets

8. Summary: Method Comparison

Graph View

Table of Contents

Backlinks