Temporal Difference Learning

Definition

Temporal Difference (TD) Learning

TD learning combines ideas from Monte Carlo Methods and Dynamic Programming. Like MC, it learns from experience without a model. Like DP, it updates estimates based on other estimates (bootstrapping) without waiting for the end of an episode.

TD(0) — The Core Update

TD(0) Update Rule

$V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]$

where:

$α$ — learning rate (step size)

$R_{t + 1} + γV (S_{t + 1})$ — TD target (one-step bootstrap estimate of $G_{t}$ )

$R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$ — TD Error $δ_{t}$

The Key Idea

Instead of waiting for the actual return $G_{t}$ (like MC does), TD uses the estimate $R_{t + 1} + γV (S_{t + 1})$ as a target. It updates $V (S_{t})$ immediately after one transition — no need to wait for the episode to end.

Comparison: TD vs MC vs DP

Property	DP	MC	TD
Requires model?	✅	❌	❌
Bootstraps?	✅	❌	✅
Learns from experience?	❌	✅	✅
Requires complete episodes?	N/A	✅	❌
Online (step-by-step)?	N/A	❌	✅

The Best of Both Worlds

TD = sample-based (like MC) + bootstrapping (like DP). It doesn’t need a model AND doesn’t need to wait for episode termination.

TD for Control

SARSA — On-Policy TD Control

SARSA Update

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$

Name comes from the quintuple: $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1}, A_{t + 1})$

Q-Learning — Off-Policy TD Control

Q-Learning Update

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Uses $max_{a}$ instead of following the actual next action — learns about the greedy (optimal) policy regardless of what action the behavior policy takes.

Expected SARSA

Expected SARSA Update

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ \sum_{a} π (a ∣ S_{t + 1}) Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Takes the expectation over next actions under the policy instead of sampling a single $A_{t + 1}$ . Lower variance than SARSA.

Backup Diagrams

TD(0):

  (S_t)
    |
   [A_t]    ← single sampled action
    |
  (S_{t+1}) ← single sampled next state, uses V(S_{t+1})

Samples one step, bootstraps from the estimate. Contrast with MC (samples to end) and DP (considers all branches).

Key Properties

Biased but consistent: TD targets are biased (use estimates), but converge to correct values
Lower variance than MC: Because it doesn’t use the full noisy return
Can learn online: Updates after every step, no need to wait for episode end
Works for continuing tasks: Unlike MC which needs episode termination
TD(0) converges to $v_{π}$ under appropriate step-size conditions (tabular case)

Connections

Combines: Monte Carlo Methods (sampling) + Dynamic Programming (bootstrapping)
Uses: TD Error, Bootstrapping
Control algorithms: SARSA, Q-Learning, Expected SARSA
Extended by: Semi-Gradient Methods, Function Approximation
Deep version: Deep Q-Network (DQN)

Study Notes

Explorer

Temporal Difference Learning

Temporal Difference Learning

Definition

TD(0) — The Core Update

Comparison: TD vs MC vs DP

TD for Control

SARSA — On-Policy TD Control

Q-Learning — Off-Policy TD Control

Expected SARSA

Backup Diagrams

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks