RL Lecture 4: Temporal Difference Learning
Temporal-Difference (TD) learning is a central and novel idea to reinforcement learning. It is a combination of Monte Carlo (MC) ideas and Dynamic Programming (DP) ideas.
- Like Monte Carlo: TD methods can learn directly from raw experience without a model of the environment’s dynamics (model-free).
- Like DP: TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
1. TD Prediction (TD(0))
The goal is to solve the prediction problem: estimating the value function for a given policy .
TD(0) Update Rule
The simplest TD method, TD(0) or one-step TD, makes the following update after transitioning from to and receiving reward :
- Target: (the TD target)
- Error: (the TD Error)
Comparison: MC vs DP vs TD
| Method | Target | Equation | Focus |
|---|---|---|---|
| Monte Carlo | Uses actual return (full sample) | ||
| Dynamic Programming | Uses full model (expectations) | ||
| TD(0) | Uses sample transitions and bootstraps |
Bootstrapping
TD methods are bootstrapping because they base their update on an existing estimate (), similar to DP. However, they are sampling because they use a single sample transition (), similar to MC.
Backup Diagrams
- TD(0): A single state node leading to a successor state via a reward . The update looks only one step ahead.
- Monte Carlo: A full path from the current state to the terminal state.
- Dynamic Programming: A complete tree of all possible transitions from to all possible .
2. TD Error ()
The TD error is the discrepancy between the current estimate and the “better” estimate available one step later.
Significance:
- It is the error in the estimate made at time , but only available at .
- If does not change during an episode, the total MC error can be written as a sum of TD errors:
3. Advantages of TD Learning
- Learns Online: Updates are made at every step, whereas MC must wait until the end of an episode.
- Continuous Tasks: TD works naturally on continuing tasks without episodes, where MC cannot be applied.
- Efficiency: TD methods usually converge faster than constant- MC on stochastic tasks (e.g., the Random Walk example).
- No Model Required: Like MC, it does not need .
4. Batch Updating and Optimality
When experience is limited, we can use batch updating—presenting the same finite sequence of experience repeatedly.
- MC Optimality: Under batch updating, MC converges to values that minimize the mean square error on the training set.
- TD(0) Optimality: Under batch updating, TD(0) converges to the certainty-equivalence estimate—the value function that would be correct if the maximum-likelihood model of the Markov process were exactly correct.
Intuition
TD takes advantage of the Markov property. It builds a consistent model of the transitions even if it hasn’t seen a particular terminal return yet, often leading to better generalizations than MC.
5. SARSA: On-policy TD Control
SARSA estimates the action-value function for the current policy.
Update Rule
The name comes from the quintuple .
Pseudocode
Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Loop for each step of episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
Q(S, A) <- Q(S, A) + alpha * [R + gamma * Q(S', A') - Q(S, A)]
S <- S'; A <- A'
until S is terminalBackup Diagram: From , look ahead to based on the action actually taken by the current policy.
6. Q-Learning: Off-policy TD Control
Q-learning directly approximates , independent of the policy being followed.
Update Rule
Pseudocode
Initialize Q(s, a) arbitrarily, Q(terminal, :) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Take action A, observe R, S'
Q(S, A) <- Q(S, A) + alpha * [R + gamma * max_a Q(S', a) - Q(S, A)]
S <- S'
until S is terminalBackup Diagram: From , look ahead to all possible actions in and take the maximum.
7. Expected SARSA
Expected SARSA uses the expected value of the next state-action pair under the policy, rather than a single sample.
Update Rule
- Pros: Reduces variance due to action selection at . It can set in deterministic environments like Cliff Walking.
- Relation: If the target policy is greedy, Expected SARSA is identical to Q-learning.
8. Cliff Walking Example: SARSA vs Q-Learning
A classic gridworld example where the agent must navigate from Start to Goal.
- Environment: A bottom row labeled “The Cliff”. Falling in gives and resets to start.
- Behavior:
- Q-Learning (Off-policy) learns the optimal path (hugging the cliff), but occasionally falls off during exploration (-greedy).
- SARSA (On-policy) learns the safer path (roundabout) because it accounts for its own exploration and knows it might fall if it gets too close.
9. Maximization Bias and Double Q-Learning
Maximization Bias
Algorithms involving a max operator (like Q-learning) can develop a positive bias because the maximum of noisy estimates is often greater than the maximum of true values.
Double Q-Learning
Maintains two independent estimates, and . One is used to select the maximizing action, and the other is used to estimate its value.
Update (if is updated):
Figure Summaries (Lecture Slides)
Backup Diagram Comparison
- MC: Long trace of state-action-reward pairs until termination. No branching.
- DP: Full branch tree from to all possible actions, then to all possible successor states.
- TD(0): Short trace from to .
- SARSA: Short trace from to .
- Q-Learning: Short trace from to , then branching to all actions with an arc signifying the
max.
Performance Examples
- Random Walk: TD(0) converges much faster than MC, reaching optimal values around 100 episodes compared to MC’s slower progression.
- Windy Gridworld: Shows the agent learning to reach the goal faster over time (Steps per episode decreasing). The windy grid shifts the next state upward depending on the column.
- Cliff Walking: Visualizes Reward per Episode. SARSA has higher average reward because it avoids the cliff, while Q-learning has lower average reward due to falling off during exploration, despite finding the shorter path.
Summary of TD Control
| Algorithm | Type | Target |
|---|---|---|
| SARSA | On-policy | |
| Q-Learning | Off-policy | |
| Expected SARSA | On/Off-policy |
References:
- Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 6.1-6.7.
- Lecture Slides RL L04.