RL Lecture 4: Temporal Difference Learning

Temporal-Difference (TD) learning is a central and novel idea to reinforcement learning. It is a combination of Monte Carlo (MC) ideas and Dynamic Programming (DP) ideas.

  • Like Monte Carlo: TD methods can learn directly from raw experience without a model of the environment’s dynamics (model-free).
  • Like DP: TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

1. TD Prediction (TD(0))

The goal is to solve the prediction problem: estimating the value function for a given policy .

TD(0) Update Rule

The simplest TD method, TD(0) or one-step TD, makes the following update after transitioning from to and receiving reward :

  • Target: (the TD target)
  • Error: (the TD Error)

Comparison: MC vs DP vs TD

MethodTargetEquationFocus
Monte CarloUses actual return (full sample)
Dynamic ProgrammingUses full model (expectations)
TD(0)Uses sample transitions and bootstraps

Bootstrapping

TD methods are bootstrapping because they base their update on an existing estimate (), similar to DP. However, they are sampling because they use a single sample transition (), similar to MC.

Backup Diagrams

  • TD(0): A single state node leading to a successor state via a reward . The update looks only one step ahead.
  • Monte Carlo: A full path from the current state to the terminal state.
  • Dynamic Programming: A complete tree of all possible transitions from to all possible .

2. TD Error ()

The TD error is the discrepancy between the current estimate and the “better” estimate available one step later.

Significance:

  • It is the error in the estimate made at time , but only available at .
  • If does not change during an episode, the total MC error can be written as a sum of TD errors:

3. Advantages of TD Learning

  1. Learns Online: Updates are made at every step, whereas MC must wait until the end of an episode.
  2. Continuous Tasks: TD works naturally on continuing tasks without episodes, where MC cannot be applied.
  3. Efficiency: TD methods usually converge faster than constant- MC on stochastic tasks (e.g., the Random Walk example).
  4. No Model Required: Like MC, it does not need .

4. Batch Updating and Optimality

When experience is limited, we can use batch updating—presenting the same finite sequence of experience repeatedly.

  • MC Optimality: Under batch updating, MC converges to values that minimize the mean square error on the training set.
  • TD(0) Optimality: Under batch updating, TD(0) converges to the certainty-equivalence estimate—the value function that would be correct if the maximum-likelihood model of the Markov process were exactly correct.

Intuition

TD takes advantage of the Markov property. It builds a consistent model of the transitions even if it hasn’t seen a particular terminal return yet, often leading to better generalizations than MC.


5. SARSA: On-policy TD Control

SARSA estimates the action-value function for the current policy.

Update Rule

The name comes from the quintuple .

Pseudocode

Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
    Initialize S
    Choose A from S using policy derived from Q (e.g., epsilon-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * Q(S', A') - Q(S, A)]
        S <- S'; A <- A'
    until S is terminal

Backup Diagram: From , look ahead to based on the action actually taken by the current policy.


6. Q-Learning: Off-policy TD Control

Q-learning directly approximates , independent of the policy being followed.

Update Rule

Pseudocode

Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
    Initialize S
    Loop for each step of episode:
        Choose A from S using policy derived from Q (e.g., epsilon-greedy)
        Take action A, observe R, S'
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * max_a Q(S', a) - Q(S, A)]
        S <- S'
    until S is terminal

Backup Diagram: From , look ahead to all possible actions in and take the maximum.


7. Expected SARSA

Expected SARSA uses the expected value of the next state-action pair under the policy, rather than a single sample.

Update Rule

  • Pros: Reduces variance due to action selection at . It can set in deterministic environments like Cliff Walking.
  • Relation: If the target policy is greedy, Expected SARSA is identical to Q-learning.

8. Cliff Walking Example: SARSA vs Q-Learning

A classic gridworld example where the agent must navigate from Start to Goal.

  • Environment: A bottom row labeled “The Cliff”. Falling in gives and resets to start.
  • Behavior:
    • Q-Learning (Off-policy) learns the optimal path (hugging the cliff), but occasionally falls off during exploration (-greedy).
    • SARSA (On-policy) learns the safer path (roundabout) because it accounts for its own exploration and knows it might fall if it gets too close.

9. Maximization Bias and Double Q-Learning

Maximization Bias

Algorithms involving a max operator (like Q-learning) can develop a positive bias because the maximum of noisy estimates is often greater than the maximum of true values.

Double Q-Learning

Maintains two independent estimates, and . One is used to select the maximizing action, and the other is used to estimate its value.

Update (if is updated):


Figure Summaries (Lecture Slides)

Backup Diagram Comparison

  • MC: Long trace of state-action-reward pairs until termination. No branching.
  • DP: Full branch tree from to all possible actions, then to all possible successor states.
  • TD(0): Short trace from to .
  • SARSA: Short trace from to .
  • Q-Learning: Short trace from to , then branching to all actions with an arc signifying the max.

Performance Examples

  • Random Walk: TD(0) converges much faster than MC, reaching optimal values around 100 episodes compared to MC’s slower progression.
  • Windy Gridworld: Shows the agent learning to reach the goal faster over time (Steps per episode decreasing). The windy grid shifts the next state upward depending on the column.
  • Cliff Walking: Visualizes Reward per Episode. SARSA has higher average reward because it avoids the cliff, while Q-learning has lower average reward due to falling off during exploration, despite finding the shorter path.

Summary of TD Control

AlgorithmTypeTarget
SARSAOn-policy
Q-LearningOff-policy
Expected SARSAOn/Off-policy

References:

  • Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 6.1-6.7.
  • Lecture Slides RL L04.