RL-Book Ch10: On-policy Control with Approximation

Overview

In this chapter, we extend Function Approximation from state-value prediction to action-value control. We transition from estimating to . While the extension is straightforward for the episodic case, the continuing case requires a shift from discounting to the average-reward formulation.

10.1 Episodic Semi-gradient Control

The semi-gradient methods developed in Chapter 9 for prediction are extended to control by approximating the action-value function .

Semi-gradient Action-Value Update

The general gradient-descent update for action-value prediction is: where is the update target (e.g., or a TD return).

Episodic Semi-gradient Sarsa

For the one-step SARSA method, the update becomes:

Policy Improvement

To perform control, we use Epsilon-Greedy Policy with respect to the current approximate action values. For a state , the greedy action is .

Algorithm: Episodic Semi-gradient Sarsa

Input: a differentiable action-value function parameterization q : S x A x R^d -> R
Algorithm parameters: step size alpha > 0, small epsilon > 0
Initialize value-function weights w arbitrarily
 
Loop for each episode:
    S, A <- initial state and action of episode (e.g., epsilon-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        If S' is terminal:
            w <- w + alpha * [R - q(S, A, w)] * grad_q(S, A, w)
            Go to next episode
        Choose A' as a function of q(S', ., w) (e.g., epsilon-greedy)
        w <- w + alpha * [R + gamma * q(S', A', w) - q(S, A, w)] * grad_q(S, A, w)
        S <- S', A <- A'

Example: Mountain Car

The Mountain Car task involves driving an underpowered car up a steep hill. Because gravity is stronger than the engine, the agent must learn to drive away from the goal first to build momentum.

Learning Curves: Typically show that intermediate step sizes work best, and as learning progresses, the “cost-to-go” function (negative of the value function) accurately represents the time needed to reach the goal from different states.

10.2 n-step Semi-gradient Sarsa

The n-step return generalizes to function approximation by using the approximate action-value of the state-action pair steps ahead.

n-step Semi-gradient Return

The update rule is:

Performance usually improves with an intermediate (e.g., or ), balancing bias and variance.

10.3 Average Reward: A New Problem Setting

In the continuing setting (non-episodic) with function approximation, discounting loses its theoretical grounding. Instead, we use the average reward setting.

Average Reward

The quality of a policy is defined as the average reward per time step: Under steady-state distribution :

In this setting, values are defined as differential values relative to the average reward:

Differential TD Error

where is an estimate of the average reward .

10.4 Deprecating the Discounted Setting

Sutton & Barto argue that discounting is “futile” in the continuing setting with function approximation.

  • If we optimize the discounted value over the on-policy distribution, the resulting policy ordering is identical to the average-reward objective, regardless of .
  • The Policy Improvement Theorem no longer holds strictly with function approximation.

10.5 Differential Semi-gradient n-step Sarsa

To generalize to n-step bootstrapping in the average reward setting, we use the differential n-step return.

Differential n-step Return

The update for remains the same, but we also update the average reward estimate :

Summary

  • Episodic Control: Straightforward extension of semi-gradient methods to .
  • Mountain Car: Demonstrates the effectiveness of SARSA with Tile Coding in continuous control.
  • Average Reward: Necessary for continuing tasks with function approximation; replaces discounting with differential value functions.
  • TD Error: Updated to include the average reward term .