RL-HW03: Homework 3 — TD Learning & Function Approximation

Exam Relevance

Q3 (minimal value error) and Q4 (semi-gradient derivation) are extremely exam-relevant. Understanding the VE objective, weighted least squares, and the “semi” in semi-gradient is core material.

Part 2: Coding Assignment Questions (SARSA vs Q-Learning)

Q2a: Average Returns Comparison (0.25p)

Q: Which algorithm achieves higher average return of the behavior policy during training? Same phenomenon as Cliff Walking (Example 6.6)?

Solution

In the Windy Gridworld:

SARSA typically achieves higher average return during training
This is the same phenomenon as Cliff Walking: SARSA (on-policy) learns to avoid risky areas that the ε-greedy exploration might stumble into, leading to better average performance of the behavior policy
Q-learning finds the optimal path but the ε-greedy behavior policy occasionally deviates from it, leading to lower average returns during training

Q2b: Return Variance (0.25p)

Q: Which algorithm achieves smaller return variance?

Solution

SARSA achieves smaller variance. Because it learns a policy that accounts for exploration, the returns are more consistent. Q-learning’s greedy policy walks near dangerous regions (optimal but risky under ε-greedy), causing occasional large negative returns.

Q2c: When Are They the Same? (0.25p)

Q: Under which condition do SARSA and Q-learning behave the same?

Solution

SARSA = Q-Learning when $ε = 0$

When the behavior policy is greedy ( $ε = 0$ ), the action $A_{t + 1}$ chosen by SARSA equals $ar g max_{a} Q (S_{t + 1}, a)$ , which is exactly the $max_{a} Q (S_{t + 1}, a)$ used by Q-learning. The updates become identical.

Q2d: Which Is Off-Policy? (0.25p)

Q: Which is off-policy and why?

Solution

Q-Learning is off-policy. Its update target uses $max_{a} Q (S_{t + 1}, a)$ — the value of the greedy (optimal) policy — regardless of what action the behavior policy actually takes next. The behavior policy (ε-greedy) ≠ the target policy (greedy).

SARSA is on-policy: its target uses $Q (S_{t + 1}, A_{t + 1})$ where $A_{t + 1}$ is the action actually taken by the current policy.

Part 3: Minimal Value Error

Q3a: Find $v_{*}$ Using Value Iteration (1.0p)

Q: For the 4-state MDP in Figure 4 (deterministic actions, rewards on transitions), find $v_{*}$ .

Solution

Requires MDP Diagram

The specific MDP diagram (Figure 4) shows 4 states with deterministic actions and rewards. Apply Value Iteration: $V_{k + 1} (s) = max_{a} [r (s, a) + γ V_{k} (s^{'})]$

With $γ$ from the problem, iterate until convergence. Since actions are deterministic, this reduces to a simple evaluation of each action’s reward + discounted next-state value.

Q3b: On-Policy Distribution $μ (s)$ (1.0p)

Q: Starting in each state with equal probability, following optimal policy, what is $μ (s)$ ?

Solution

On-Policy Distribution

$μ (s)$ is the fraction of time spent in each state under the given policy and starting state distribution.

With uniform starting distribution and deterministic optimal policy:

Trace the trajectories from each starting state under $π_{*}$
Count the fraction of total time steps spent in each state
$μ (s) = \frac{time steps in s}{total time steps across all trajectories}$

Q3c: Minimize VE with Linear FA (1.5p)

Q: Given features $ϕ (s_{1}) = [04]$ , $ϕ (s_{2}) = [01]$ , $ϕ (s_{3}) = [30]$ , $ϕ (s_{4}) = [20]$ , $ϕ (T) = [00]$ . Find weights $w$ minimizing VE.

Solution

Weighted Least Squares

The Mean Squared Value Error: $\overline{V E} (w) = \sum_{s} μ (s) [v_{*} (s) - w^{⊤} ϕ (s)]^{2}$

This is a weighted least squares problem. In matrix form: $w^{*} = (Φ^{⊤} D Φ)^{- 1} Φ^{⊤} D v_{*}$

where:

$Φ$ is the feature matrix (rows = feature vectors for each state)

$D = diag (μ (s_{1}), μ (s_{2}), μ (s_{3}), μ (s_{4}))$ is the diagonal weight matrix

$v_{*}$ is the vector of optimal values from Q3a

Hint from the problem

Use weighted least squares directly. Set up $Φ$ , $D$ , and $v_{*}$ , then solve. Note: the terminal state has $ϕ (T) = 0$ , so its value is automatically 0 regardless of $w$ .

Q3d: Relationship to Gradient MC (1.0p)

Q: What values does $\tilde{v}$ assign to each state? Relationship to gradient MC?

Solution

$\tilde{v} (s) = w^{* ⊤} ϕ (s)$ for each state.

Connection to Gradient MC

Gradient MC converges to the same minimum — the weights that minimize $\overline{V E}$ . The analytical solution (weighted least squares) gives us the exact answer that gradient MC would converge to with infinite data and proper step-size schedule. Gradient MC performs stochastic gradient descent on the VE objective.

Part 4: (Semi-) Gradient Descent Methods

Q4a: Unbiased Estimators Analysis (2.5p)

Q: For the update $w_{t + 1} = w_{t} + α_{t} [U_{t} - \overset{v}{^} (S_{t}, w_{t})] \nabla \overset{v}{^} (S_{t}, w_{t})$ , determine which targets are unbiased estimators of $v_{π} (s)$ :

$U_{t} = G_{t}$
$U_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t})$
$U_{t} = \sum_{a, s^{'}, r} π (a ∣ S_{t}) p (s^{'}, r ∣ S_{t}, a) [r + γ \overset{v}{^} (S_{t + 1}, w_{t})]$

Solution

Bias Analysis

1. $U_{t} = G_{t}$ — UNBIASED ✅ By definition: $E_{π} [G_{t} ∣ S_{t} = s] = v_{π} (s)$ . The actual return is an unbiased estimate. → This is Gradient Monte Carlo

2. $U_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t})$ — BIASED ❌ $E [U_{t} ∣ S_{t} = s] = E [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w_{t}) ∣ S_{t} = s]$ This equals $v_{π} (s)$ only if $\overset{v}{^} = v_{π}$ , which is generally not true during learning. The bootstrapped estimate introduces bias. → This is Semi-Gradient TD(0)

3. $U_{t} = \sum_{a, s^{'}, r} π (a ∣ S_{t}) p (s^{'}, r ∣ S_{t}, a) [r + γ \overset{v}{^} (S_{t + 1}, w_{t})]$ — BIASED ❌ Same issue: uses $\overset{v}{^}$ instead of $v_{π}$ . This is the expected update (no sampling), but still biased. → This is the Expected (DP-like) update

Which guarantees convergence to local optimum of VE? Only $U_{t} = G_{t}$ (Gradient MC), because it’s the only unbiased estimator, making the update a true stochastic gradient of $\overline{V E}$ .

Example step size: $α_{t} = 1/ t$ satisfies $\sum α_{t} = \infty$ , $\sum α_{t}^{2} < \infty$ .

Why use biased estimators?

Lower variance → faster learning in practice
Can learn online (step-by-step) without waiting for episode end
Example: continuing tasks (no episodes → MC impossible); or environments with very long episodes where MC has extreme variance

Q4b: Derive Mean Squared TD Error Minimization (2.0p)

Q: Derive weight update that minimizes the mean squared TD error. Compare to Semi-Gradient TD.

Solution

Mean Squared TD Error

$MSTDE (w) = E [δ_{t}^{2}] = E [(R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w))^{2}]$

Taking the gradient: $\nabla_{w} MSTDE = 2 E [δ_{t} \cdot \nabla_{w} δ_{t}]$

$\nabla_{w} δ_{t} = \nabla_{w} [R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w)] = γ \nabla \overset{v}{^} (S_{t + 1}, w) - \nabla \overset{v}{^} (S_{t}, w)$

Full Gradient Update (True Gradient of MSTDE)

$w_{t + 1} = w_{t} - α δ_{t} [γ \nabla \overset{v}{^} (S_{t + 1}, w_{t}) - \nabla \overset{v}{^} (S_{t}, w_{t})]$ $= w_{t} + α δ_{t} \nabla \overset{v}{^} (S_{t}, w_{t}) - α γ δ_{t} \nabla \overset{v}{^} (S_{t + 1}, w_{t})$

Comparison with Semi-Gradient TD

Semi-Gradient TD uses only: $w_{t + 1} = w_{t} + α δ_{t} \nabla \overset{v}{^} (S_{t}, w_{t})$

It drops the $- α γ δ_{t} \nabla \overset{v}{^} (S_{t + 1}, w_{t})$ term — the gradient through the bootstrapped target. This is why it’s called “semi-gradient”: it only takes half the gradient (the part through the prediction, not through the target).

The missing term is exactly the correction that Gradient-TD Methods (TDC) add back.

Study Notes

Explorer

RL-HW03 - Homework 3

RL-HW03: Homework 3 — TD Learning & Function Approximation

Part 2: Coding Assignment Questions (SARSA vs Q-Learning)

Q2a: Average Returns Comparison (0.25p)

Solution

Q2b: Return Variance (0.25p)

Solution

Q2c: When Are They the Same? (0.25p)

Solution

Q2d: Which Is Off-Policy? (0.25p)

Solution

Part 3: Minimal Value Error

Q3a: Find $v_{*}$ Using Value Iteration (1.0p)

Solution

Q3b: On-Policy Distribution $μ (s)$ (1.0p)

Solution

Q3c: Minimize VE with Linear FA (1.5p)

Solution

Q3d: Relationship to Gradient MC (1.0p)

Solution

Part 4: (Semi-) Gradient Descent Methods

Q4a: Unbiased Estimators Analysis (2.5p)

Solution

Q4b: Derive Mean Squared TD Error Minimization (2.0p)

Solution

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-HW03 - Homework 3

RL-HW03: Homework 3 — TD Learning & Function Approximation

Part 2: Coding Assignment Questions (SARSA vs Q-Learning)

Q2a: Average Returns Comparison (0.25p)

Solution

Q2b: Return Variance (0.25p)

Solution

Q2c: When Are They the Same? (0.25p)

Solution

Q2d: Which Is Off-Policy? (0.25p)

Solution

Part 3: Minimal Value Error

Q3a: Find v∗​ Using Value Iteration (1.0p)

Solution

Q3b: On-Policy Distribution μ(s) (1.0p)

Solution

Q3c: Minimize VE with Linear FA (1.5p)

Solution

Q3d: Relationship to Gradient MC (1.0p)

Solution

Part 4: (Semi-) Gradient Descent Methods

Q4a: Unbiased Estimators Analysis (2.5p)

Solution

Q4b: Derive Mean Squared TD Error Minimization (2.0p)

Solution

Graph View

Table of Contents

Backlinks

Q3a: Find $v_{*}$ Using Value Iteration (1.0p)

Q3b: On-Policy Distribution $μ (s)$ (1.0p)