Temporal Difference Learning
Definition
Temporal Difference (TD) Learning
TD learning combines ideas from Monte Carlo Methods and Dynamic Programming. Like MC, it learns from experience without a model. Like DP, it updates estimates based on other estimates (bootstrapping) without waiting for the end of an episode.
TD(0) — The Core Update
TD(0) Update Rule
The Key Idea
Instead of waiting for the actual return (like MC does), TD uses the estimate as a target. It updates immediately after one transition — no need to wait for the episode to end.
Comparison: TD vs MC vs DP
| Property | DP | MC | TD |
|---|---|---|---|
| Requires model? | ✅ | ❌ | ❌ |
| Bootstraps? | ✅ | ❌ | ✅ |
| Learns from experience? | ❌ | ✅ | ✅ |
| Requires complete episodes? | N/A | ✅ | ❌ |
| Online (step-by-step)? | N/A | ❌ | ✅ |
The Best of Both Worlds
TD = sample-based (like MC) + bootstrapping (like DP). It doesn’t need a model AND doesn’t need to wait for episode termination.
TD for Control
SARSA — On-Policy TD Control
SARSA Update
Name comes from the quintuple:
Q-Learning — Off-Policy TD Control
Q-Learning Update
Uses instead of following the actual next action — learns about the greedy (optimal) policy regardless of what action the behavior policy takes.
Expected SARSA
Expected SARSA Update
Takes the expectation over next actions under the policy instead of sampling a single . Lower variance than SARSA.
Backup Diagrams
TD(0):
(S_t)
|
[A_t] ← single sampled action
|
(S_{t+1}) ← single sampled next state, uses V(S_{t+1})
Samples one step, bootstraps from the estimate. Contrast with MC (samples to end) and DP (considers all branches).
Key Properties
- Biased but consistent: TD targets are biased (use estimates), but converge to correct values
- Lower variance than MC: Because it doesn’t use the full noisy return
- Can learn online: Updates after every step, no need to wait for episode end
- Works for continuing tasks: Unlike MC which needs episode termination
- TD(0) converges to under appropriate step-size conditions (tabular case)
Connections
- Combines: Monte Carlo Methods (sampling) + Dynamic Programming (bootstrapping)
- Uses: TD Error, Bootstrapping
- Control algorithms: SARSA, Q-Learning, Expected SARSA
- Extended by: Semi-Gradient Methods, Function Approximation
- Deep version: Deep Q-Network (DQN)