Chapter 9: On-policy Prediction with Approximation

Overview

This chapter explores how to estimate the state-value function $v_{π}$ from on-policy data when the state space is too large for tabular representations. We transition from tables to parameterized functional forms with a weight vector $w \in R^{d}$ , where $d ≪ ∣ S ∣$ .

Intuition

In function approximation, an update to one state affects many others through generalization. This makes learning more powerful but also more complex to manage, as we cannot get the values of all states exactly correct.

9.1 Value-function Approximation

All prediction methods are viewed as updates $s \mapsto u$ , where $s$ is the state and $u$ is the target.

Monte Carlo: $S_{t} \mapsto G_{t}$
TD(0): $S_{t} \mapsto R_{t + 1} + \overset{v}{^} (S_{t + 1}, w_{t})$
n-step TD: $S_{t} \mapsto G_{t : t + n}$

This is essentially a supervised learning task where we provide $(s, u)$ training examples to a Function Approximation method. However, RL requires methods that can handle:

Online Learning: Learning while interacting with the environment.
Nonstationary Targets: The target $v_{π}$ changes as the policy improves (control) or because of Bootstrapping (TD/DP).

9.2 The Prediction Objective ( $\overline{V E}$ )

In the tabular case, we could reach $v_{π} (s)$ exactly for all $s$ . With approximation, we must decide which states to prioritize using a state distribution $μ (s)$ .

Mean Squared Value Error (VE)

$\overline{V E} (w) ≐ \sum_{s \in S} μ (s) [v_{π} (s) - \overset{v}{^} (s, w)]^{2}$ Where $μ (s)$ is typically the on-policy distribution (fraction of time spent in $s$ under policy $π$ ).

Under on-policy training, $μ (s)$ for a continuing task is the stationary distribution. For episodic tasks: $η (s) = h (s) + \sum_{\overset{s}{ˉ} \in S} η (\overset{s}{ˉ}) \sum_{a} π (a ∣ \overset{s}{ˉ}) p (s ∣ \overset{s}{ˉ}, a)$ $μ (s) = \frac{η ( s )}{\sum _{s^{'}} η ( s ^{'} )}$

9.3 Stochastic Gradient and Semi-gradient Methods

Stochastic Gradient Descent (SGD) is ideal for online RL. We adjust weights to reduce the error on the current example:

SGD Update Rule

$w_{t + 1} ≐ w_{t} - \frac{1}{2} α \nabla [v_{π} (S_{t}) - \overset{v}{^} (S_{t}, w_{t})]^{2} = w_{t} + α [v_{π} (S_{t}) - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$ where $\nabla \overset{v}{^} (S_{t}, w_{t})$ is the gradient vector of partial derivatives with respect to $w$ .

Gradient Monte Carlo

Since $G_{t}$ is an unbiased estimate of $v_{π} (S_{t})$ , MC using SGD converges to a local optimum.

# Gradient Monte Carlo Algorithm
Initialize w arbitrarily
Loop for each episode:
    Generate episode S_0, A_0, R_1, ..., S_T using pi
    Loop for each step t = 0, ..., T-1:
        w <- w + alpha * [G_t - v_hat(S_t, w)] * grad_v_hat(S_t, w)

Semi-Gradient Methods

When we use Bootstrapping (TD), the target depends on the weights $w_{t}$ , meaning the update is not a true gradient. We call these Semi-Gradient Methods.

Semi-gradient TD(0) Update

$w_{t + 1} ≐ w_{t} + α [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$

# Semi-gradient TD(0) Algorithm
Initialize w arbitrarily
Loop for each episode:
    Initialize S
    Loop for each step of episode:
        Choose A ~ pi(.|S)
        Take action A, observe R, S'
        w <- w + alpha * [R + gamma*v_hat(S', w) - v_hat(S, w)] * grad_v_hat(S, w)
        S <- S'
    Until S is terminal

Warning

Semi-gradient methods do not converge as robustly as gradient methods but often learn faster and enable online learning.

9.4 Linear Methods

In Linear Function Approximation, the estimate is the inner product of weights and a feature vector $x (s)$ : $\overset{v}{^} (s, w) ≐ w^{⊤} x (s) = \sum_{i = 1}^{d} w_{i} x_{i} (s)$ Here, the gradient is simply the feature vector: $\nabla \overset{v}{^} (s, w) = x (s)$ .

The TD Fixed Point

Linear semi-gradient TD(0) converges to the TD fixed point: $w_{T D} = A^{- 1} b$ Where:

$A ≐ E [x_{t} (x_{t} - γ x_{t + 1})^{⊤}]$
$b ≐ E [R_{t + 1} x_{t}]$

At this point, the error is bounded: $\overline{V E} (w_{T D}) \leq \frac{1}{1 - γ} min_{w} \overline{V E} (w)$ .

9.5 Feature Construction

The choice of features determines how the agent generalizes.

Polynomials and Fourier Basis

Polynomials: Features are combinations of state dimensions (e.g., $1, s_{1}, s_{2}, s_{1} s_{2}$ ). They have difficulty with high dimensionality.
Fourier Basis: Uses cosine functions of different frequencies: $x_{i} (s) = cos (π s^{⊤} c_{i})$ . Often performs better than polynomials in RL.

Coarse Coding

States are represented by overlapping binary features (e.g., circles in a 2D space).

Narrow features: Fine discrimination, slow generalization.
Broad features: Broad generalization, coarse initial approximation.

Tile Coding

A computationally efficient form of coarse coding using shifted grids (tilings).

Each state falls into exactly one tile per tiling.
Total number of active features = number of tilings.
Hashing can be used to reduce memory requirements.

Tile Coding Illustration

If you have 8 tilings shifted asymmetrically, a single point in state space activates 1 feature in each tiling. Generalization occurs to any state that shares one or more tiles.

Radial Basis Functions (RBFs)

Continuous version of coarse coding. Features have a Gaussian response: $x_{i} (s) = exp (- \frac{∥ s - c _{i} ∥ ^{2}}{2 σ _{i}^{2}})$

9.7 Neural Network Function Approximation

Neural Network Function Approximation uses multi-layer ANNs to learn non-linear approximations.

Hidden layers: Automatically create features.
Backpropagation: Computes gradients of the loss with respect to weights.
Deep RL: Successes in Go and Atari rely on deep convolutional networks that extract hierarchical spatial features.

9.8 Least-Squares TD (LSTD)

Instead of iterative updates, LSTD estimates $A$ and $b$ directly.

Complexity: $O (d^{2})$ per step using the Sherman-Morrison formula for recursive matrix inversion.
Data Efficiency: Most efficient linear TD method; no step-size $α$ needed (though it needs a regularization parameter $ϵ$ ).

9.11 Interest and Emphasis

We can focus approximation on specific “interesting” states using:

Interest $I_{t}$ : How much we care about the error at time $t$ .
Emphasis $M_{t}$ : A multiplier for the update that maintains stability. $M_{t} = I_{t} + γ^{n} M_{t - n}$

Summary

Function Approximation is necessary for large state spaces.
SGD provides the theoretical bedrock for updates.
Linear Methods with Tile Coding or Fourier Basis are robust and efficient.
Deep Learning allows for complex, non-linear feature discovery.
There is a trade-off between Local Optimum convergence (Gradient MC) and the faster, biased convergence of Semi-Gradient Methods.

Study Notes

Explorer

RL-Book Ch9 - On-Policy Prediction with Approximation

Chapter 9: On-policy Prediction with Approximation

Overview

9.1 Value-function Approximation

9.2 The Prediction Objective ( $\overline{V E}$ )

9.3 Stochastic Gradient and Semi-gradient Methods

Gradient Monte Carlo

Semi-Gradient Methods

9.4 Linear Methods

The TD Fixed Point

9.5 Feature Construction

Polynomials and Fourier Basis

Coarse Coding

Tile Coding

Radial Basis Functions (RBFs)

9.7 Neural Network Function Approximation

9.8 Least-Squares TD (LSTD)

9.11 Interest and Emphasis

Summary

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-Book Ch9 - On-Policy Prediction with Approximation

Chapter 9: On-policy Prediction with Approximation

Overview

9.1 Value-function Approximation

9.2 The Prediction Objective (VE)

9.3 Stochastic Gradient and Semi-gradient Methods

Gradient Monte Carlo

Semi-Gradient Methods

9.4 Linear Methods

The TD Fixed Point

9.5 Feature Construction

Polynomials and Fourier Basis

Coarse Coding

Tile Coding

Radial Basis Functions (RBFs)

9.7 Neural Network Function Approximation

9.8 Least-Squares TD (LSTD)

9.11 Interest and Emphasis

Summary

Graph View

Table of Contents

Backlinks

9.2 The Prediction Objective ( $\overline{V E}$ )