RL-ES03: Exercise Set Week 3 — Advanced TD & Approximation
Chapter 5: From Tabular Learning to Approximation
5.1 Off-Policy TD
Setup: MDP with states and actions . Behavior policy : uniform (0.5/0.5). Target policy : , . Undiscounted ().
Q5.1.1: Calculate and
Solution
Using value iteration (start from terminal, work backwards):
- , , (same for both policies)
For :
Q5.1.2: One Pass of SARSA ()
Data: ,
Initial Q-table:
| -1 | 0.5 | |
| -1 | +1 |
Key Insight
Only changes — all other Q-values already equal their target values.
First transition :
- Target:
- Update:
Second transition :
- Target:
- Update:
On-policy SARSA moves Q toward
The update pushes from 0.5 toward 0 (which is ). Repeated passes would converge to .
Q5.1.3: SARSA with Importance Weights
First transition: IS ratio
- Update:
Second transition: IS ratio
- Update:
Off-policy IS-SARSA moves Q toward
Now the update pushes toward (which is ). The importance weights correct the distribution.
Q5.1.4: Why is Q-Learning Off-Policy?
Answer
In Q-Learning, the target policy (greedy: ) differs from the behavior policy (e.g., ε-greedy). The update uses regardless of which action was actually taken — learning about the optimal policy while following an exploratory one.
Q5.1.5: Q-Learning vs IS-SARSA (Greedy Target)
Both converge to the same , but IS-SARSA wastes samples when and disagree (ratio = 0 for off-greedy actions), and has variance issues when ratio is large. Q-learning is preferred — it implicitly handles the off-policy correction through the max operator.
Q5.1.6: Why Not Off-Policy V-Learning?
Off-policy V-learning (TD(0) with IS) is possible but less useful because:
- In prediction: we usually want (evaluate current behavior), so off-policy isn’t needed
- In control: off-policy is important, but V-functions require a model for policy improvement (). Q-functions don’t need the model.
Q5.1.7: Q-Learning for V Functions?
No. Q-learning works by taking over targets. For , we’d need — but we only observe the reward and next state for the action actually taken. We don’t have data for all actions from each state. In Q-learning, each is stored separately, so this isn’t an issue.
5.2 Function Approximation and State Distribution
Q5.2.1-3: Dependence on Parameters
-
depends on the policy, which depends on the value function approximator’s parameters . Changing → changes → changes which states are visited → changes .
-
In supervised learning, the data distribution is fixed and independent of model parameters. In RL, the data distribution changes as the agent learns.
-
This means the weighting in the objective () is itself non-stationary — the states we care most about change as we learn.
Chapter 6: On-Policy TD with Approximation
6.1 On-Policy Distributions and LSTD
Setup: 2-state MDP with , features , . Initial distribution . Transitions: from each state, to , to terminal/other. Rewards: from (one transition), from (one transition).
Q6.1.1: On-Policy Distribution
Solution
Solve :
Solution:
Normalize:
Q6.1.2: Transition Frequencies
- From : each transition occurs with frequency
- From : each transition occurs with frequency
Q6.1.3: LSTD Solution
LSTD Computation
Weight each transition by its frequency:
Computing:
Solution:
6.2 Basis Functions
Q6.2.1: Tabular as Special Case of Linear FA
Answer
Use one-hot feature vectors. For state in an -state MDP: (standard basis vector, 1 at position , 0 elsewhere).
Then:
Each state has its own independent weight — exactly tabular RL.
Q6.2.2: Linear vs Non-Linear FA Advantages
Linear:
- Easier gradient ()
- LSTD: closed-form TD fixed point
- Strong convergence guarantees
Non-Linear:
- More expressive (better performance with enough data)
- Automatic feature learning (no manual design)
- Flexible architectures (CNNs, Transformers, etc.)
6.3 Semi-Gradient TD and the TD Fixed Point
Setup: 4-state MDP (travel costs), . Linear approximation with features: , , , , .
Q6.3.1: Semi-Gradient Update
Given , transition , learning rate :
Solution
- (with : target = … wait, actually )
- TD error:
Note: The solution in the answer key gives using a slightly different interpretation of . Check the feature computation carefully with the specific MDP rewards.
Q6.3.2: LSTD vs Semi-Gradient TD Relationship
Key Result
LSTD finds the TD Fixed Point directly. Semi-gradient TD, if it converges, converges to the same TD fixed point. They target the same solution — LSTD computes it in closed form, semi-gradient TD converges to it iteratively.
Q6.3.3: LSTD on Given Trajectories
Trajectories:
Full LSTD Computation
Computing each term (outer products):
Wait — let me redo with correct features. , , , :
Following the answer key:
Wait, using the answer key:
Approximate values:
- …
Per the answer key: giving , , , .
Q6.3.4: Quality of the Solution
Where It Fails
The “top route” (): features capture the value well (true values: , ).
The “bottom route” (): features struggle. should be () and . But the features and can’t independently represent these — ‘s value is tied to , which also affects .
The TD fixed point makes a trade-off, weighted by the on-policy distribution .
Q6.3.5: “Never Forgetting” (LSTD)
The TD fixed point is a function of all data ever seen (via the and matrices). LSTD uses all past transitions equally — it “never forgets.”
Advantage: More sample efficient — no data is thrown away. Disadvantage: If the MDP or policy changes (non-stationarity), old data becomes misleading. Want to gradually forget old experience to adapt.
Q6.3.6: Neural Network Semi-Gradient Update
# For transition (s, a, r, s', a'):
val = NN_w(s) # forward pass
val_prime = NN_w(s') # forward pass (no grad needed)
val.backward() # backward pass: computes ∂NN/∂w → w.grad
# Semi-gradient update:
w = w + alpha * (r + gamma * val_prime - val) * w.gradSemi-Gradient: No Gradient Through Target
val_primeis treated as a constant (no.backward()through it). This is what makes it “semi-gradient.” The gradient only flows through the prediction , not the target .
6.4 Preparatory Question: Off-Policy Approximation
Baird's Counterexample
A notebook exercise on Canvas demonstrates the Deadly Triad — semi-gradient TD with linear FA diverges under off-policy updates. See RL-L07 - Off-Policy RL with Approximation and Off-Policy Divergence for theory.