Chapter 6: Temporal-Difference Learning

Overview

Temporal-Difference (TD) learning is the central and novel idea of Reinforcement Learning. It is a combination of Monte Carlo Methods and Dynamic Programming (DP) ideas:

  • Like Monte Carlo: TD methods can learn directly from raw experience without a model of the environment’s dynamics.
  • Like DP: TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Intuition

TD learning updates its “guess” based on another “guess” further down the line, rather than waiting for the final reward at the end of an episode.


6.1 TD Prediction

The goal is to estimate the Value Function for a given Policy .

The TD(0) Update Rule

In constant- MC, the update is: where is the actual return. In TD(0) (one-step TD), the update is performed immediately after transitioning to and receiving :

TD(0) Update

TD Error

The quantity in the brackets is the TD Error, denoted by :

Pseudocode: Tabular TD(0)

# Tabular TD(0) for estimating v_pi
Input: policy pi to be evaluated
Algorithm parameter: step size alpha in (0, 1]
Initialize V(s) for all s in S+, arbitrarily, V(terminal) = 0
 
Loop for each episode:
    Initialize S
    Loop for each step of episode:
        A = action given by pi for S
        Take action A, observe R, S'
        V(S) <- V(S) + alpha * [R + gamma * V(S') - V(S)]
        S = S'
    until S is terminal

Backup Diagrams

  • TD(0): Updates from a single sample transition ().
  • MC: Updates from the entire sequence of rewards until the end of the episode.
  • DP: Updates based on the complete distribution of all possible successors (expected update).

6.2 Advantages of TD Prediction Methods

  1. Model-Free: Does not require .
  2. Online/Incremental: Updates every time step. MC must wait until the end of an episode.
  3. Convergence: TD(0) converges to for a fixed policy .
  4. Efficiency: Empirically, TD methods often converge faster than MC on stochastic tasks (e.g., the Random Walk example).

6.3 Optimality of TD(0)

In Batch Updating, where experience is presented repeatedly until convergence:

  • Batch MC minimizes mean square error on the training set (best fit to observed returns).
  • Batch TD(0) converges to the Certainty-Equivalence Estimate (the value function that would be correct for the maximum-likelihood model of the MDP).

The Predictor's Dilemma

If state A always leads to B (reward 0), and B has a 75% chance of return 1, TD(0) says (correct for the underlying Markov process), while MC might say if the single observed path from A gave return 0.


6.4 SARSA: On-Policy TD Control

To solve the control problem, we switch to an action-value function .

SARSA Update

The name comes from the quintuple .

Pseudocode: SARSA

# Sarsa (on-policy TD control) for estimating Q ~ q*
Algorithm parameters: step size alpha in (0, 1], small epsilon > 0
Initialize Q(s, a) for all s, a, Q(terminal, .) = 0
 
Loop for each episode:
    Initialize S
    Choose A from S using policy derived from Q (e.g., epsilon-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * Q(S', A') - Q(S, A)]
        S = S'; A = A'
    until S is terminal

6.5 Q-Learning: Off-Policy TD Control

Q-Learning directly approximates , the optimal action-value function, independent of the policy being followed.

Q-Learning Update

Pseudocode: Q-Learning

# Q-learning (off-policy TD control) for estimating pi ~ pi*
Algorithm parameters: step size alpha in (0, 1], small epsilon > 0
Initialize Q(s, a) for all s, a, Q(terminal, .) = 0
 
Loop for each episode:
    Initialize S
    Loop for each step of episode:
        Choose A from S using policy derived from Q (e.g., epsilon-greedy)
        Take action A, observe R, S'
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * max_a Q(S', a) - Q(S, A)]
        S = S'
    until S is terminal

Cliff Walking

  • Q-Learning learns the optimal path (along the edge) but suffers lower online rewards because -greedy exploration causes it to fall off the cliff.
  • SARSA learns a “safer” roundabout path because it takes the exploration into account (it is on-policy).

6.6 Expected SARSA

Expected SARSA uses the expected value over next actions instead of the maximum or a single sample: It is more computationally complex but reduces variance compared to SARSA.


6.7 Maximization Bias and Double Learning

Maximization over estimated values can lead to a positive bias (Maximization Bias) because the maximum of estimates is usually greater than the maximum of true values .

Double Q-Learning

To eliminate this, use two independent estimates and :

  • Use to find the maximizing action:
  • Use to estimate the value:

Summary Comparison

FeatureTD(0)SARSAQ-Learning
TypePredictionControl (On-policy)Control (Off-policy)
Target
BootstrappingYesYesYes
Model Req.NoNoNo

Created for RL Course - Obsidian University Vault