RL-Book Ch10: On-policy Control with Approximation

Overview

In this chapter, we extend Function Approximation from state-value prediction to action-value control. We transition from estimating $v_{π} (s)$ to $q_{π} (s, a)$ . While the extension is straightforward for the episodic case, the continuing case requires a shift from discounting to the average-reward formulation.

10.1 Episodic Semi-gradient Control

The semi-gradient methods developed in Chapter 9 for prediction are extended to control by approximating the action-value function $q (s, a, w) \approx q_{π} (s, a)$ .

Semi-gradient Action-Value Update

The general gradient-descent update for action-value prediction is: $w_{t + 1} = w_{t} + α [U_{t} - q (S_{t}, A_{t}, w_{t})] \nabla q (S_{t}, A_{t}, w_{t})$ where $U_{t}$ is the update target (e.g., $G_{t}$ or a TD return).

Episodic Semi-gradient Sarsa

For the one-step SARSA method, the update becomes: $w_{t + 1} = w_{t} + α [R_{t + 1} + γ q (S_{t + 1}, A_{t + 1}, w_{t}) - q (S_{t}, A_{t}, w_{t})] \nabla q (S_{t}, A_{t}, w_{t})$

Policy Improvement

To perform control, we use Epsilon-Greedy Policy with respect to the current approximate action values. For a state $s$ , the greedy action is $A^{*} = ar g max_{a} q (s, a, w)$ .

Algorithm: Episodic Semi-gradient Sarsa

Input: a differentiable action-value function parameterization q : S x A x R^d -> R
Algorithm parameters: step size alpha > 0, small epsilon > 0
Initialize value-function weights w arbitrarily
 
Loop for each episode:
    S, A <- initial state and action of episode (e.g., epsilon-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        If S' is terminal:
            w <- w + alpha * [R - q(S, A, w)] * grad_q(S, A, w)
            Go to next episode
        Choose A' as a function of q(S', ., w) (e.g., epsilon-greedy)
        w <- w + alpha * [R + gamma * q(S', A', w) - q(S, A, w)] * grad_q(S, A, w)
        S <- S', A <- A'

Example: Mountain Car

The Mountain Car task involves driving an underpowered car up a steep hill. Because gravity is stronger than the engine, the agent must learn to drive away from the goal first to build momentum.

State: Position and Velocity.
Actions: Forward (+1), Reverse (-1), Zero (0).
Reward: -1 per step until the goal is reached.
Function Approximation: Linear Function Approximation with Tile Coding.

Learning Curves: Typically show that intermediate step sizes $α$ work best, and as learning progresses, the “cost-to-go” function (negative of the value function) accurately represents the time needed to reach the goal from different states.

10.2 n-step Semi-gradient Sarsa

The n-step return generalizes to function approximation by using the approximate action-value of the state-action pair $n$ steps ahead.

n-step Semi-gradient Return

$G_{t : t + n} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} q (S_{t + n}, A_{t + n}, w_{t + n - 1})$

The update rule is: $w_{t + n} = w_{t + n - 1} + α [G_{t : t + n} - q (S_{t}, A_{t}, w_{t + n - 1})] \nabla q (S_{t}, A_{t}, w_{t + n - 1})$

Performance usually improves with an intermediate $n$ (e.g., $n = 4$ or $n = 8$ ), balancing bias and variance.

10.3 Average Reward: A New Problem Setting

In the continuing setting (non-episodic) with function approximation, discounting loses its theoretical grounding. Instead, we use the average reward setting.

Average Reward $r (π)$

The quality of a policy $π$ is defined as the average reward per time step: $r (π) = lim_{h \to \infty} \frac{1}{h} \sum_{t = 1}^{h} E [R_{t} ∣ S_{0}, π]$ Under steady-state distribution $μ_{π} (s)$ : $r (π) = \sum_{s} μ_{π} (s) \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) r$

In this setting, values are defined as differential values relative to the average reward: $q_{π} (s, a) = E_{π} [\sum_{k = t + 1}^{\infty} (R_{k} - r (π)) S_{t} = s, A_{t} = a]$

Differential TD Error

$δ_{t} = R_{t + 1} - \overset{ˉ}{R}_{t} + q (S_{t + 1}, A_{t + 1}, w_{t}) - q (S_{t}, A_{t}, w_{t})$ where $\overset{ˉ}{R}_{t}$ is an estimate of the average reward $r (π)$ .

10.4 Deprecating the Discounted Setting

Sutton & Barto argue that discounting is “futile” in the continuing setting with function approximation.

If we optimize the discounted value over the on-policy distribution, the resulting policy ordering is identical to the average-reward objective, regardless of $γ$ .
The Policy Improvement Theorem no longer holds strictly with function approximation.

10.5 Differential Semi-gradient n-step Sarsa

To generalize to n-step bootstrapping in the average reward setting, we use the differential n-step return.

Differential n-step Return

$G_{t : t + n} = R_{t + 1} - \overset{ˉ}{R}_{t + n - 1} + R_{t + 2} - \overset{ˉ}{R}_{t + n - 1} + \dots + R_{t + n} - \overset{ˉ}{R}_{t + n - 1} + q (S_{t + n}, A_{t + n}, w_{t + n - 1})$

The update for $w$ remains the same, but we also update the average reward estimate $\overset{ˉ}{R}$ : $\overset{ˉ}{R}_{t + 1} = \overset{ˉ}{R}_{t} + β δ_{t}$

Summary

Episodic Control: Straightforward extension of semi-gradient methods to $q (s, a, w)$ .
Mountain Car: Demonstrates the effectiveness of SARSA with Tile Coding in continuous control.
Average Reward: Necessary for continuing tasks with function approximation; replaces discounting with differential value functions.
TD Error: Updated to include the average reward term $R_{t + 1} - \overset{ˉ}{R}_{t}$ .

Study Notes

Explorer

RL-Book Ch10 - On-Policy Control with Approximation

RL-Book Ch10: On-policy Control with Approximation

Overview

10.1 Episodic Semi-gradient Control

Episodic Semi-gradient Sarsa

Algorithm: Episodic Semi-gradient Sarsa

Example: Mountain Car

10.2 n-step Semi-gradient Sarsa

10.3 Average Reward: A New Problem Setting

10.4 Deprecating the Discounted Setting

10.5 Differential Semi-gradient n-step Sarsa

Summary

Graph View

Table of Contents

Backlinks