RL-L03: Monte Carlo Methods

Overview

Monte Carlo (MC) methods learn value functions and optimal policies from experience in the form of sample episodes.

Key Characteristics

  • Model-free: Unlike Dynamic Programming, MC does not require knowledge of MDP dynamics ().
  • Averages Returns: Estimates are based on averaging sample returns for each state-action pair.
  • Episodic Tasks: Defined only for episodic tasks as it requires the completion of an episode to calculate the return .
  • No Bootstrapping: MC methods do not update estimates based on other estimates; they use actual sampled returns.

MC vs. DP Comparison

FeatureDynamic ProgrammingMonte Carlo Methods
ModelNeeds full Model-free (Sample experience)
BootstrappingYes (Updates based on next state values)No (Updates based on returns)
WidthFull (Expectation over all transitions)Single (Sample trajectory)
Depth1-step lookaheadFull (Until end of episode)

1. Monte Carlo Prediction

The goal of prediction is to estimate the state-value function under a fixed policy .

The Return

For a trajectory , the return is: By the law of large numbers, the average return converges to the expected value:

First-Visit vs. Every-Visit MC

  • First-Visit MC: Averages returns only from the first time a state is visited in each episode.
    • Each return is an i.i.d. estimate of .
    • Convergence is .
  • Every-Visit MC: Averages returns from all visits to in each episode.
    • Estimates are not independent but also converge to quadratically.

First-Visit MC Prediction Algorithm

Algorithm: First-visit MC prediction

Input: a policy to be evaluated Initialize: arbitrarily, empty list Loop forever (for each episode):

  1. Generate an episode following :
  2. Loop backwards :
    • Unless appears in :
      • Append to

2. Blackjack Example

Blackjack is a classic episodic MDP used to illustrate MC prediction.

  • Objective: Maximize card sum .
  • State Space:
    • Current sum (12-21)
    • Dealer’s showing card (Ace-10)
    • Usable Ace (Yes/No)
    • Total: 200 states.
  • Rewards: +1 for win, -1 for loss, 0 for draw.
  • Action: Hit or Stick.
  • Policy Evaluation: Average returns over thousands of simulated games (episodes).
  • Observation: States with usable aces are less frequent and thus have higher variance in the value function estimate.

3. Monte Carlo Control

Control aims to approximate optimal policies using Generalized Policy Iteration (GPI).

Action Values ()

Without a model, state values are insufficient for control (cannot look ahead). We must estimate action-value functions .

The Exploration-Exploitation Dilemma

Many state-action pairs might never be visited if is deterministic. Two solutions:

  1. Exploring Starts: Assume every episode starts at a random state-action pair with non-zero probability.
  2. On-Policy: Use -greedy policies.
  3. Off-Policy: Use a separate behavior policy to explore.

Algorithm: Monte Carlo ES (Exploring Starts)

This algorithm alternates between evaluation and improvement episode-by-episode.

Algorithm: Monte Carlo ES

Initialize: arbitrarily, arbitrarily, empty Loop forever:

  1. Choose such that all pairs have probability (Exploring Starts)
  2. Generate episode from following :
  3. Loop backwards :
    • Unless appeared earlier in the episode:
      • Append to

4. On-Policy MC Control (-greedy)

Avoids exploring starts by using a soft policy (e.g., -greedy).

-greedy Improvement

For an -soft policy , an -greedy policy wrt is an improvement ().

Proof Idea (PIT):


5. Off-Policy Prediction and Control

Learn about a target policy while following a behavior policy ().

Coverage Assumption

The behavior policy must be able to take any action that might take:

Importance Sampling (IS) Ratio

To transform expectations from to , we weight returns by the probability of the trajectory occurring under vs. : Note: Transition dynamics cancel out!

Types of Importance Sampling

  1. Ordinary IS: Simple average of scaled returns.
    • Unbiased, but can have infinite variance.
  2. Weighted IS: Weighted average of scaled returns.
    • Biased (bias as ), but finite variance.

Algorithm: Off-policy MC Control

Algorithm: Off-policy MC Control

Initialize: arbitrarily, , Loop forever:

  1. Select soft behavior policy ; Generate episode following :
  2. Loop backwards :
    • If then exit inner loop

6. Incremental Implementation

Weighted IS can be implemented incrementally to avoid storing all returns. Given with weights :


7. Diagrams

Backup Diagram: MC Prediction

      (S_t)       <-- Root (state to update)
        |
     [A_t, R_{t+1}]
        |
      (S_{t+1})
        |
     [A_{t+1}, R_{t+2}]
        |
       ...
        |
      ((T))       <-- Terminal (end of episode)

Contrast with DP: MC looks at a single, full trajectory.


Summary Key Points

  • MC learns from experience, avoiding the need for environment models.
  • Goal: Average returns to estimate expectations.
  • GPI applies: use evaluation (averaging) and improvement (greedy/-greedy).
  • Off-policy requires Importance Sampling to account for different behavior.
  • Variance is the main challenge in MC, especially in Off-policy IS.