RL-HW03: Homework 3 — TD Learning & Function Approximation

Exam Relevance

Q3 (minimal value error) and Q4 (semi-gradient derivation) are extremely exam-relevant. Understanding the VE objective, weighted least squares, and the “semi” in semi-gradient is core material.


Part 2: Coding Assignment Questions (SARSA vs Q-Learning)

Q2a: Average Returns Comparison (0.25p)

Q: Which algorithm achieves higher average return of the behavior policy during training? Same phenomenon as Cliff Walking (Example 6.6)?

Solution

In the Windy Gridworld:

  • SARSA typically achieves higher average return during training
  • This is the same phenomenon as Cliff Walking: SARSA (on-policy) learns to avoid risky areas that the ε-greedy exploration might stumble into, leading to better average performance of the behavior policy
  • Q-learning finds the optimal path but the ε-greedy behavior policy occasionally deviates from it, leading to lower average returns during training

Q2b: Return Variance (0.25p)

Q: Which algorithm achieves smaller return variance?

Solution

SARSA achieves smaller variance. Because it learns a policy that accounts for exploration, the returns are more consistent. Q-learning’s greedy policy walks near dangerous regions (optimal but risky under ε-greedy), causing occasional large negative returns.

Q2c: When Are They the Same? (0.25p)

Q: Under which condition do SARSA and Q-learning behave the same?

Solution

SARSA = Q-Learning when

When the behavior policy is greedy (), the action chosen by SARSA equals , which is exactly the used by Q-learning. The updates become identical.

Q2d: Which Is Off-Policy? (0.25p)

Q: Which is off-policy and why?

Solution

Q-Learning is off-policy. Its update target uses — the value of the greedy (optimal) policy — regardless of what action the behavior policy actually takes next. The behavior policy (ε-greedy) ≠ the target policy (greedy).

SARSA is on-policy: its target uses where is the action actually taken by the current policy.


Part 3: Minimal Value Error

Q3a: Find Using Value Iteration (1.0p)

Q: For the 4-state MDP in Figure 4 (deterministic actions, rewards on transitions), find .

Solution

Requires MDP Diagram

The specific MDP diagram (Figure 4) shows 4 states with deterministic actions and rewards. Apply Value Iteration:

With from the problem, iterate until convergence. Since actions are deterministic, this reduces to a simple evaluation of each action’s reward + discounted next-state value.

Q3b: On-Policy Distribution (1.0p)

Q: Starting in each state with equal probability, following optimal policy, what is ?

Solution

On-Policy Distribution

is the fraction of time spent in each state under the given policy and starting state distribution.

With uniform starting distribution and deterministic optimal policy:

  • Trace the trajectories from each starting state under
  • Count the fraction of total time steps spent in each state

Q3c: Minimize VE with Linear FA (1.5p)

Q: Given features , , , , . Find weights minimizing VE.

Solution

Weighted Least Squares

The Mean Squared Value Error:

This is a weighted least squares problem. In matrix form:

where:

  • is the feature matrix (rows = feature vectors for each state)
  • is the diagonal weight matrix
  • is the vector of optimal values from Q3a

Hint from the problem

Use weighted least squares directly. Set up , , and , then solve. Note: the terminal state has , so its value is automatically 0 regardless of .

Q3d: Relationship to Gradient MC (1.0p)

Q: What values does assign to each state? Relationship to gradient MC?

Solution

for each state.

Connection to Gradient MC

Gradient MC converges to the same minimum — the weights that minimize . The analytical solution (weighted least squares) gives us the exact answer that gradient MC would converge to with infinite data and proper step-size schedule. Gradient MC performs stochastic gradient descent on the VE objective.


Part 4: (Semi-) Gradient Descent Methods

Q4a: Unbiased Estimators Analysis (2.5p)

Q: For the update , determine which targets are unbiased estimators of :

Solution

Bias Analysis

1. — UNBIASED ✅ By definition: . The actual return is an unbiased estimate. → This is Gradient Monte Carlo

2. — BIASED ❌ This equals only if , which is generally not true during learning. The bootstrapped estimate introduces bias. → This is Semi-Gradient TD(0)

3. — BIASED ❌ Same issue: uses instead of . This is the expected update (no sampling), but still biased. → This is the Expected (DP-like) update

Which guarantees convergence to local optimum of VE? Only (Gradient MC), because it’s the only unbiased estimator, making the update a true stochastic gradient of .

Example step size: satisfies , .

Why use biased estimators?

  • Lower variance → faster learning in practice
  • Can learn online (step-by-step) without waiting for episode end
  • Example: continuing tasks (no episodes → MC impossible); or environments with very long episodes where MC has extreme variance

Q4b: Derive Mean Squared TD Error Minimization (2.0p)

Q: Derive weight update that minimizes the mean squared TD error. Compare to Semi-Gradient TD.

Solution

Mean Squared TD Error

Taking the gradient:

Full Gradient Update (True Gradient of MSTDE)

Comparison with Semi-Gradient TD

Semi-Gradient TD uses only:

It drops the term — the gradient through the bootstrapped target. This is why it’s called “semi-gradient”: it only takes half the gradient (the part through the prediction, not through the target).

The missing term is exactly the correction that Gradient-TD Methods (TDC) add back.