RL-HW02: Homework 2 — Monte Carlo & TD

Exam Relevance

Questions on IS derivations (1a), first-visit vs every-visit calculations (2a), and SARSA vs Q-learning convergence analysis (3a-3c) are classic exam material.


Part 1: Importance Sampling Derivations

Q1a: Incremental Update Rule for Ordinary IS (1.0p)

Q: Given , derive the incremental update rule of the form .

Solution

Starting from:

Derivation

So: and .

Contrast with Weighted IS

For weighted IS (equations 5.7-5.8 in book), the step size is where . For ordinary IS, the step size is simply — just a standard running average, but the target is (the importance-weighted return).

Q1b: Advantages of MC over DP (1.5p)

Q: Two advantages of MC over DP? When to use each?

Solution

  1. Model-free: MC doesn’t need — learns directly from episodes of experience
  2. Computational focus: MC can estimate value for a single state without computing all states (estimates are independent)

Use DP when: You have a complete model and a manageable state space. Use MC when: No model available, or state space is too large for full sweeps.


Part 2: On-policy and Off-policy MC

Q2a: First-Visit vs Every-Visit Calculation (1.0p)

Q: Using policy , trajectory . Calculate for first-visit and every-visit MC. Use .

Solution

Trajectory visits at with rewards .

Returns from each visit:

  • :
  • : (note: means after the second transition, state visited again)
  • :

Wait — let me reparse the trajectory. Format:

So:

State visited at .

Returns ():

First-Visit MC

Only use (first visit to ):

Every-Visit MC

Average all visits:

Q2b: Off-Policy Advantages (1.0p)

Solution:

  • Advantage of off-policy: Can learn the optimal (greedy) policy while following an exploratory behavior policy. Separates exploration from the policy being learned.
  • Why soft policy from as behavior: Ensures coverage — every action that might take has non-zero probability under . A soft (e.g., ε-greedy) version of guarantees this while staying close to ‘s behavior.

Q2c: Weighted IS Calculation (1.0p)

Q: Behavior policy sampled trajectory . Compute and using first-visit MC with weighted IS.

Solution

: Same as first-visit MC under . , so .

using weighted IS: The Importance Sampling ratio for the full trajectory from :

With only one episode, weighted IS gives:

Single-Episode Weighted IS

With only one episode, weighted IS always returns regardless of the importance ratio (the ratio cancels). This is a known property — weighted IS has this bias for small sample sizes but lower variance overall.

Q2d: Ordinary vs Weighted IS from Graph (0.5p)

Q: Line A shows ordinary IS, Line B shows weighted IS. How to distinguish?

Solution: Ordinary IS (Line A) has higher variance — larger fluctuations, potentially spiking high or low. Weighted IS (Line B) is smoother with lower variance. Ordinary IS is unbiased but can have extreme values; weighted IS is slightly biased but much more stable.


Part 3: SARSA and Q-Learning Analysis

Q3a: SARSA Convergence with Parameter n (1.5p)

Q: MDP with states (start), , , (terminal). ε-greedy with , . . Find converged Q-values under SARSA and determine policy preference.

Solution

Key Insight

SARSA learns the value of the ε-greedy policy (on-policy). The converged Q-values reflect the expected return including random exploration actions.

The actual Q-values depend on the specific transition structure of the MDP (which includes rewards on edges). Since SARSA is on-policy, the Q-values account for the probability of taking random actions.

The policy in state prefers when , which depends on (the reward parameter on action in state ). The exact threshold depends on the MDP structure.

Without the Exact MDP Diagram

This problem requires the specific MDP diagram (Figure 2 from the homework PDF) to compute exact values. The key principle: SARSA’s converged Q-values include the effect of ε-greedy exploration, so the threshold for preferring vs will be different from Q-learning.

Q3b: Q-Learning Convergence (1.5p)

Q: Same MDP but with Q-learning.

Solution

Q-Learning converges to (optimal action-values) regardless of the behavior policy. Q-values reflect the greedy (optimal) policy, not the ε-greedy behavior.

The threshold for preferring will differ from SARSA because Q-learning doesn’t penalize for exploration randomness.

Q3c: Adding Action a₂ (1.0p)

Q: With , add action in (goes to , reward 1). Would final performance differ for SARSA and Q-learning across the two MDPs?

Solution

  • Q-Learning: The final performance under the target policy (greedy w.r.t. ) could change — the new action might be optimal if its Q-value is highest. Q-learning would discover this.
  • SARSA: The final performance could also change — SARSA learns the ε-greedy policy’s value, and adding a new action changes the ε-greedy exploration probabilities (now per random action instead of ), affecting the learned values.

The Key Point

Q-learning’s greedy policy will pick the best of the (now three) actions. SARSA’s policy values change because the exploration distribution changes. Both algorithms are affected, but for different reasons.