RL-Book Ch10: On-policy Control with Approximation
Overview
In this chapter, we extend Function Approximation from state-value prediction to action-value control. We transition from estimating to . While the extension is straightforward for the episodic case, the continuing case requires a shift from discounting to the average-reward formulation.
10.1 Episodic Semi-gradient Control
The semi-gradient methods developed in Chapter 9 for prediction are extended to control by approximating the action-value function .
Semi-gradient Action-Value Update
The general gradient-descent update for action-value prediction is: where is the update target (e.g., or a TD return).
Episodic Semi-gradient Sarsa
For the one-step SARSA method, the update becomes:
Policy Improvement
To perform control, we use Epsilon-Greedy Policy with respect to the current approximate action values. For a state , the greedy action is .
Algorithm: Episodic Semi-gradient Sarsa
Input: a differentiable action-value function parameterization q : S x A x R^d -> R
Algorithm parameters: step size alpha > 0, small epsilon > 0
Initialize value-function weights w arbitrarily
Loop for each episode:
S, A <- initial state and action of episode (e.g., epsilon-greedy)
Loop for each step of episode:
Take action A, observe R, S'
If S' is terminal:
w <- w + alpha * [R - q(S, A, w)] * grad_q(S, A, w)
Go to next episode
Choose A' as a function of q(S', ., w) (e.g., epsilon-greedy)
w <- w + alpha * [R + gamma * q(S', A', w) - q(S, A, w)] * grad_q(S, A, w)
S <- S', A <- A'Example: Mountain Car
The Mountain Car task involves driving an underpowered car up a steep hill. Because gravity is stronger than the engine, the agent must learn to drive away from the goal first to build momentum.
- State: Position and Velocity.
- Actions: Forward (+1), Reverse (-1), Zero (0).
- Reward: -1 per step until the goal is reached.
- Function Approximation: Linear Function Approximation with Tile Coding.
Learning Curves: Typically show that intermediate step sizes work best, and as learning progresses, the “cost-to-go” function (negative of the value function) accurately represents the time needed to reach the goal from different states.
10.2 n-step Semi-gradient Sarsa
The n-step return generalizes to function approximation by using the approximate action-value of the state-action pair steps ahead.
n-step Semi-gradient Return
The update rule is:
Performance usually improves with an intermediate (e.g., or ), balancing bias and variance.
10.3 Average Reward: A New Problem Setting
In the continuing setting (non-episodic) with function approximation, discounting loses its theoretical grounding. Instead, we use the average reward setting.
Average Reward
The quality of a policy is defined as the average reward per time step: Under steady-state distribution :
In this setting, values are defined as differential values relative to the average reward:
Differential TD Error
where is an estimate of the average reward .
10.4 Deprecating the Discounted Setting
Sutton & Barto argue that discounting is “futile” in the continuing setting with function approximation.
- If we optimize the discounted value over the on-policy distribution, the resulting policy ordering is identical to the average-reward objective, regardless of .
- The Policy Improvement Theorem no longer holds strictly with function approximation.
10.5 Differential Semi-gradient n-step Sarsa
To generalize to n-step bootstrapping in the average reward setting, we use the differential n-step return.
Differential n-step Return
The update for remains the same, but we also update the average reward estimate :
Summary
- Episodic Control: Straightforward extension of semi-gradient methods to .
- Mountain Car: Demonstrates the effectiveness of SARSA with Tile Coding in continuous control.
- Average Reward: Necessary for continuing tasks with function approximation; replaces discounting with differential value functions.
- TD Error: Updated to include the average reward term .