RL-HW03: Homework 3 — TD Learning & Function Approximation
Exam Relevance
Q3 (minimal value error) and Q4 (semi-gradient derivation) are extremely exam-relevant. Understanding the VE objective, weighted least squares, and the “semi” in semi-gradient is core material.
Part 2: Coding Assignment Questions (SARSA vs Q-Learning)
Q2a: Average Returns Comparison (0.25p)
Q: Which algorithm achieves higher average return of the behavior policy during training? Same phenomenon as Cliff Walking (Example 6.6)?
Solution
In the Windy Gridworld:
- SARSA typically achieves higher average return during training
- This is the same phenomenon as Cliff Walking: SARSA (on-policy) learns to avoid risky areas that the ε-greedy exploration might stumble into, leading to better average performance of the behavior policy
- Q-learning finds the optimal path but the ε-greedy behavior policy occasionally deviates from it, leading to lower average returns during training
Q2b: Return Variance (0.25p)
Q: Which algorithm achieves smaller return variance?
Solution
SARSA achieves smaller variance. Because it learns a policy that accounts for exploration, the returns are more consistent. Q-learning’s greedy policy walks near dangerous regions (optimal but risky under ε-greedy), causing occasional large negative returns.
Q2c: When Are They the Same? (0.25p)
Q: Under which condition do SARSA and Q-learning behave the same?
Solution
SARSA = Q-Learning when
When the behavior policy is greedy (), the action chosen by SARSA equals , which is exactly the used by Q-learning. The updates become identical.
Q2d: Which Is Off-Policy? (0.25p)
Q: Which is off-policy and why?
Solution
Q-Learning is off-policy. Its update target uses — the value of the greedy (optimal) policy — regardless of what action the behavior policy actually takes next. The behavior policy (ε-greedy) ≠ the target policy (greedy).
SARSA is on-policy: its target uses where is the action actually taken by the current policy.
Part 3: Minimal Value Error
Q3a: Find Using Value Iteration (1.0p)
Q: For the 4-state MDP in Figure 4 (deterministic actions, rewards on transitions), find .
Solution
Requires MDP Diagram
The specific MDP diagram (Figure 4) shows 4 states with deterministic actions and rewards. Apply Value Iteration:
With from the problem, iterate until convergence. Since actions are deterministic, this reduces to a simple evaluation of each action’s reward + discounted next-state value.
Q3b: On-Policy Distribution (1.0p)
Q: Starting in each state with equal probability, following optimal policy, what is ?
Solution
On-Policy Distribution
is the fraction of time spent in each state under the given policy and starting state distribution.
With uniform starting distribution and deterministic optimal policy:
- Trace the trajectories from each starting state under
- Count the fraction of total time steps spent in each state
Q3c: Minimize VE with Linear FA (1.5p)
Q: Given features , , , , . Find weights minimizing VE.
Solution
Weighted Least Squares
This is a weighted least squares problem. In matrix form:
where:
- is the feature matrix (rows = feature vectors for each state)
- is the diagonal weight matrix
- is the vector of optimal values from Q3a
Hint from the problem
Use weighted least squares directly. Set up , , and , then solve. Note: the terminal state has , so its value is automatically 0 regardless of .
Q3d: Relationship to Gradient MC (1.0p)
Q: What values does assign to each state? Relationship to gradient MC?
Solution
for each state.
Connection to Gradient MC
Gradient MC converges to the same minimum — the weights that minimize . The analytical solution (weighted least squares) gives us the exact answer that gradient MC would converge to with infinite data and proper step-size schedule. Gradient MC performs stochastic gradient descent on the VE objective.
Part 4: (Semi-) Gradient Descent Methods
Q4a: Unbiased Estimators Analysis (2.5p)
Q: For the update , determine which targets are unbiased estimators of :
Solution
Bias Analysis
1. — UNBIASED ✅ By definition: . The actual return is an unbiased estimate. → This is Gradient Monte Carlo
2. — BIASED ❌ This equals only if , which is generally not true during learning. The bootstrapped estimate introduces bias. → This is Semi-Gradient TD(0)
3. — BIASED ❌ Same issue: uses instead of . This is the expected update (no sampling), but still biased. → This is the Expected (DP-like) update
Which guarantees convergence to local optimum of VE? Only (Gradient MC), because it’s the only unbiased estimator, making the update a true stochastic gradient of .
Example step size: satisfies , .
Why use biased estimators?
- Lower variance → faster learning in practice
- Can learn online (step-by-step) without waiting for episode end
- Example: continuing tasks (no episodes → MC impossible); or environments with very long episodes where MC has extreme variance
Q4b: Derive Mean Squared TD Error Minimization (2.0p)
Q: Derive weight update that minimizes the mean squared TD error. Compare to Semi-Gradient TD.
Solution
Mean Squared TD Error
Taking the gradient:
Full Gradient Update (True Gradient of MSTDE)
Comparison with Semi-Gradient TD
Semi-Gradient TD uses only:
It drops the term — the gradient through the bootstrapped target. This is why it’s called “semi-gradient”: it only takes half the gradient (the part through the prediction, not through the target).
The missing term is exactly the correction that Gradient-TD Methods (TDC) add back.