RL Lecture 4: Temporal Difference Learning

Temporal-Difference (TD) learning is a central and novel idea to reinforcement learning. It is a combination of Monte Carlo (MC) ideas and Dynamic Programming (DP) ideas.

Like Monte Carlo: TD methods can learn directly from raw experience without a model of the environment’s dynamics (model-free).
Like DP: TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

1. TD Prediction (TD(0))

The goal is to solve the prediction problem: estimating the value function $v_{π}$ for a given policy $π$ .

TD(0) Update Rule

The simplest TD method, TD(0) or one-step TD, makes the following update after transitioning from $S_{t}$ to $S_{t + 1}$ and receiving reward $R_{t + 1}$ :

$V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]$

Target: $R_{t + 1} + γV (S_{t + 1})$ (the TD target)
Error: $δ_{t} = R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$ (the TD Error)

Comparison: MC vs DP vs TD

Method	Target	Equation	Focus
Monte Carlo	$G_{t}$	$V (S_{t}) \leftarrow V (S_{t}) + α [G_{t} - V (S_{t})]$	Uses actual return (full sample)
Dynamic Programming	$E_{π} [R_{t + 1} + γV (S_{t + 1})]$	$V (S_{t}) \leftarrow \sum_{a} π (a ∥ s) \sum_{s^{'}, r} p (s^{'}, r ∥ s, a) [r + γV (s^{'})]$	Uses full model (expectations)
TD(0)	$R_{t + 1} + γV (S_{t + 1})$	$V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]$	Uses sample transitions and bootstraps

Bootstrapping

TD methods are bootstrapping because they base their update on an existing estimate ( $V (S_{t + 1})$ ), similar to DP. However, they are sampling because they use a single sample transition ( $R_{t + 1}$ ), similar to MC.

Backup Diagrams

TD(0): A single state node $S_{t}$ leading to a successor state $S_{t + 1}$ via a reward $R_{t + 1}$ . The update looks only one step ahead.
Monte Carlo: A full path from the current state $S_{t}$ to the terminal state.
Dynamic Programming: A complete tree of all possible transitions from $S_{t}$ to all possible $S_{t + 1}$ .

2. TD Error ( $δ$ )

The TD error is the discrepancy between the current estimate $V (S_{t})$ and the “better” estimate $R_{t + 1} + γV (S_{t + 1})$ available one step later.

$δ_{t} = R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$

Significance:

It is the error in the estimate made at time $t$ , but only available at $t + 1$ .
If $V$ does not change during an episode, the total MC error can be written as a sum of TD errors: $G_{t} - V (S_{t}) = \sum_{k = t}^{T - 1} γ^{k - t} δ_{k}$

3. Advantages of TD Learning

Learns Online: Updates are made at every step, whereas MC must wait until the end of an episode.
Continuous Tasks: TD works naturally on continuing tasks without episodes, where MC cannot be applied.
Efficiency: TD methods usually converge faster than constant- $α$ MC on stochastic tasks (e.g., the Random Walk example).
No Model Required: Like MC, it does not need $p (s^{'}, r ∣ s, a)$ .

4. Batch Updating and Optimality

When experience is limited, we can use batch updating—presenting the same finite sequence of experience repeatedly.

MC Optimality: Under batch updating, MC converges to values that minimize the mean square error on the training set.
TD(0) Optimality: Under batch updating, TD(0) converges to the certainty-equivalence estimate—the value function that would be correct if the maximum-likelihood model of the Markov process were exactly correct.

Intuition

TD takes advantage of the Markov property. It builds a consistent model of the transitions even if it hasn’t seen a particular terminal return yet, often leading to better generalizations than MC.

5. SARSA: On-policy TD Control

SARSA estimates the action-value function $Q (s, a)$ for the current policy.

Update Rule

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$ The name comes from the quintuple $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1}, A_{t + 1})$ .

Pseudocode

Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
    Initialize S
    Choose A from S using policy derived from Q (e.g., epsilon-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * Q(S', A') - Q(S, A)]
        S <- S'; A <- A'
    until S is terminal

Backup Diagram: From $(S_{t}, A_{t})$ , look ahead to $(S_{t + 1}, A_{t + 1})$ based on the action actually taken by the current policy.

6. Q-Learning: Off-policy TD Control

Q-learning directly approximates $q^{*}$ , independent of the policy being followed.

Update Rule

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Pseudocode

Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
    Initialize S
    Loop for each step of episode:
        Choose A from S using policy derived from Q (e.g., epsilon-greedy)
        Take action A, observe R, S'
        Q(S, A) <- Q(S, A) + alpha * [R + gamma * max_a Q(S', a) - Q(S, A)]
        S <- S'
    until S is terminal

Backup Diagram: From $(S_{t}, A_{t})$ , look ahead to all possible actions $a$ in $S_{t + 1}$ and take the maximum.

7. Expected SARSA

Expected SARSA uses the expected value of the next state-action pair under the policy, rather than a single sample.

Update Rule

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ \sum_{a} π (a ∣ S_{t + 1}) Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Pros: Reduces variance due to action selection at $S_{t + 1}$ . It can set $α = 1$ in deterministic environments like Cliff Walking.
Relation: If the target policy is greedy, Expected SARSA is identical to Q-learning.

8. Cliff Walking Example: SARSA vs Q-Learning

A classic gridworld example where the agent must navigate from Start to Goal.

Environment: A bottom row labeled “The Cliff”. Falling in gives $R = - 100$ and resets to start.
Behavior:
- Q-Learning (Off-policy) learns the optimal path (hugging the cliff), but occasionally falls off during exploration ( $ϵ$ -greedy).
- SARSA (On-policy) learns the safer path (roundabout) because it accounts for its own exploration and knows it might fall if it gets too close.

9. Maximization Bias and Double Q-Learning

Maximization Bias

Algorithms involving a max operator (like Q-learning) can develop a positive bias because the maximum of noisy estimates is often greater than the maximum of true values.

Double Q-Learning

Maintains two independent estimates, $Q_{1}$ and $Q_{2}$ . One is used to select the maximizing action, and the other is used to estimate its value.

Update (if $Q_{1}$ is updated): $Q_{1} (S, A) \leftarrow Q_{1} (S, A) + α [R + γ Q_{2} (S^{'}, argmax_{a} Q_{1} (S^{'}, a)) - Q_{1} (S, A)]$

Figure Summaries (Lecture Slides)

Backup Diagram Comparison

MC: Long trace of state-action-reward pairs until termination. No branching.
DP: Full branch tree from $S_{t}$ to all possible actions, then to all possible successor states.
TD(0): Short trace from $S_{t}$ to $S_{t + 1}$ .
SARSA: Short trace from $(S_{t}, A_{t})$ to $(S_{t + 1}, A_{t + 1})$ .
Q-Learning: Short trace from $(S_{t}, A_{t})$ to $S_{t + 1}$ , then branching to all actions with an arc signifying the max.

Performance Examples

Random Walk: TD(0) converges much faster than MC, reaching optimal values around 100 episodes compared to MC’s slower progression.
Windy Gridworld: Shows the agent learning to reach the goal faster over time (Steps per episode decreasing). The windy grid shifts the next state upward depending on the column.
Cliff Walking: Visualizes Reward per Episode. SARSA has higher average reward because it avoids the cliff, while Q-learning has lower average reward due to falling off during exploration, despite finding the shorter path.

Summary of TD Control

Algorithm	Type	Target
SARSA	On-policy	$R + γ Q (S^{'}, A^{'})$
Q-Learning	Off-policy	$R + γ max_{a} Q (S^{'}, a)$
Expected SARSA	On/Off-policy	$R + γ \sum_{a} π (a ∥ S^{'}) Q (S^{'}, a)$

References:

Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 6.1-6.7.
Lecture Slides RL L04.

Study Notes

Explorer

RL-L04 - Temporal Difference Learning

RL Lecture 4: Temporal Difference Learning

1. TD Prediction (TD(0))

TD(0) Update Rule

Comparison: MC vs DP vs TD

Backup Diagrams

2. TD Error ( $δ$ )

3. Advantages of TD Learning

4. Batch Updating and Optimality

5. SARSA: On-policy TD Control

Update Rule

Pseudocode

6. Q-Learning: Off-policy TD Control

Update Rule

Pseudocode

7. Expected SARSA

Update Rule

8. Cliff Walking Example: SARSA vs Q-Learning

9. Maximization Bias and Double Q-Learning

Maximization Bias

Double Q-Learning

Figure Summaries (Lecture Slides)

Backup Diagram Comparison

Performance Examples

Summary of TD Control

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-L04 - Temporal Difference Learning

RL Lecture 4: Temporal Difference Learning

1. TD Prediction (TD(0))

TD(0) Update Rule

Comparison: MC vs DP vs TD

Backup Diagrams

2. TD Error (δ)

3. Advantages of TD Learning

4. Batch Updating and Optimality

5. SARSA: On-policy TD Control

Update Rule

Pseudocode

6. Q-Learning: Off-policy TD Control

Update Rule

Pseudocode

7. Expected SARSA

Update Rule

8. Cliff Walking Example: SARSA vs Q-Learning

9. Maximization Bias and Double Q-Learning

Maximization Bias

Double Q-Learning

Figure Summaries (Lecture Slides)

Backup Diagram Comparison

Performance Examples

Summary of TD Control

Graph View

Table of Contents

Backlinks

2. TD Error ( $δ$ )