Chapter 5: Monte Carlo Methods

Overview

Monte Carlo Methods are ways of solving the reinforcement learning problem based on averaging sample returns. Unlike Dynamic Programming (DP), MC methods do not assume complete knowledge of the environment’s dynamics (the function). They require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction.

Key Characteristics

  • Experience-based: Learns from sample trajectories.
  • No Model Required: Only requires samples, not transition probabilities.
  • Episodic Tasks Only: Returns must be well-defined at the end of an episode.
  • No Bootstrapping: Estimates for one state do not depend on estimates for other states (unlike DP).

5.1 Monte Carlo Prediction

MC Prediction aims to estimate the Value Function for a fixed policy .

First-Visit vs. Every-Visit MC

  • First-Visit MC: Averages the returns following only the first time a state is visited in an episode.
  • Every-Visit MC: Averages the returns following all visits to state in an episode.

Both converge to as the number of visits approaches infinity.

Algorithm: First-visit MC Prediction

Initialize:
    pi = policy to be evaluated
    V(s) = arbitrary real numbers, for all s in S
    Returns(s) = empty list, for all s in S
 
Loop forever (for each episode):
    Generate an episode following pi: S0, A0, R1, ..., ST-1, AT-1, RT
    G = 0
    Loop for each step of episode, t = T-1, T-2, ..., 0:
        G = gamma * G + Rt+1
        Unless St appears in S0, S1, ..., St-1:
            Append G to Returns(St)
            V(St) = average(Returns(St))

Blackjack

In Blackjack, the state includes the player’s sum (12-21), the dealer’s showing card (Ace-10), and whether the player has a “usable ace”.

  • Goal: Reach sum and higher than the dealer.
  • Rewards: +1 for win, -1 for loss, 0 for draw.
  • MC Advantage: Obtaining the exact probabilities for dealer card transitions is complex, but simulating games is easy.

Backup Diagrams

For MC estimation of :

  • Root: A state node.
  • Path: A single entire trajectory ending at a terminal state.
  • Contrast: DP diagrams show all possible transitions for one step; MC diagrams show one sampled transition for all steps to the end.

5.2 Monte Carlo Estimation of Action Values

Without a model, estimating is insufficient for control. We must estimate Action-Value Functions .

The Exploration Problem

If is deterministic, many state-action pairs may never be visited. To evaluate all actions, we need to maintain exploration.

To ensure all state-action pairs are visited, we specify that episodes begin at a pair, and every pair has a non-zero probability of being selected as the start.


5.3 Monte Carlo Control

Control follows the Generalized Policy Iteration (GPI) pattern:

  1. Policy Evaluation: Estimate .
  2. Policy Improvement: Make greedy with respect to :

Algorithm: Monte Carlo ES (Exploring Starts)

Initialize:
    pi(s) in A(s) (arbitrarily)
    Q(s, a) in R (arbitrarily)
    Returns(s, a) = empty list
 
Loop forever (for each episode):
    Choose S0 in S, A0 in A(S0) randomly (Exploring Starts)
    Generate an episode from S0, A0, following pi
    G = 0
    Loop for each step of episode, t = T-1, T-2, ..., 0:
        G = gamma * G + Rt+1
        Unless St, At appears in S0, A0, ..., St-1, At-1:
            Append G to Returns(St, At)
            Q(St, At) = average(Returns(St, At))
            pi(St) = argmax_a Q(St, a)

5.4 On-Policy MC Control

To avoid the unrealistic “Exploring Starts” assumption, we use on-policy methods that use Epsilon-Greedy Policy to ensure continuous exploration.

Epsilon-Greedy Policy

For a state , selected action :

Algorithm: On-policy first-visit MC control

Initialize:
    pi = an arbitrary epsilon-soft policy
    Q(s, a) = arbitrary
    Returns(s, a) = empty list
 
Repeat forever:
    Generate an episode following pi
    G = 0
    Loop for each step t = T-1, ..., 0:
        G = gamma * G + Rt+1
        Unless St, At appears in earlier steps:
            Append G to Returns(St, At)
            Q(St, At) = average(Returns(St, At))
            A_star = argmax_a Q(St, a)
            For all a in A(St):
                pi(a|St) = (1 - eps + eps/|A|) if a == A_star else (eps/|A|)

5.5 Off-Policy Prediction via Importance Sampling

Off-Policy Learning evaluates a target policy while following a behavior policy .

Assumption of Coverage

Every action taken under must be taken at least occasionally under .

Importance Sampling Ratio

The relative probability of a trajectory occurring under vs. :

Two Types of IS

Let be the set of time steps in which state was visited.

  1. Ordinary Importance Sampling:
    • Unbiased, but can have infinite variance.
  2. Weighted Importance Sampling:
    • Biased (asymptotically zero), but much lower variance. Strongly preferred in practice.

5.6 Incremental Implementation

To avoid storing all returns, we update estimates incrementally. For Weighted IS, we maintain a cumulative weight sum .

Incremental Update

Algorithm: Off-policy MC Prediction

Initialize: Q(s, a) arbitrarily, C(s, a) = 0
Repeat forever:
    b = any policy with coverage of pi
    Generate episode following b
    G = 0, W = 1
    Loop for each step t = T-1, ..., 0, while W != 0:
        G = gamma * G + Rt+1
        C(St, At) = C(St, At) + W
        Q(St, At) = Q(St, At) + (W / C(St, At)) * [G - Q(St, At)]
        W = W * (pi(At|St) / b(At|St))

5.7 Off-Policy Monte Carlo Control

Separates the behavior policy (exploratory) from the target policy (learning about optimality).

Algorithm: Off-policy MC Control

Initialize: Q(s, a) arbitrarily, C(s, a) = 0, pi(s) = argmax Q(s, a)
Repeat forever:
    b = any epsilon-soft policy
    Generate episode following b
    G = 0, W = 1
    Loop for each step t = T-1, ..., 0:
        G = gamma * G + Rt+1
        C(St, At) = C(St, At) + W
        Q(St, At) = Q(St, At) + (W / C(St, At)) * [G - Q(St, At)]
        pi(St) = argmax_a Q(St, a)
        If At != pi(St): break Loop
        W = W * (1 / b(At|St))

Efficiency Issue

Off-policy MC only learns from the tails of episodes (once behavior matches the greedy policy). This can significantly slow down learning in long episodes.


5.8 & 5.9 Improving Importance Sampling*

  • Discounting-aware IS: Uses the structure of discounted returns to associate importance sampling ratios only with relevant rewards, reducing variance when .
  • Per-decision IS: Even for , helps by observing that , removing noise from future actions that don’t affect immediate rewards.

Summary: MC vs. DP

FeatureDynamic Programming (DP)Monte Carlo Methods (MC)
ModelRequires Requires only samples
BootstrappingYes (estimates from estimates)No (completes episodes)
IndependenceEstimates are interdependentEstimates for one state are independent
ApplicabilityFull state sweepsCan focus on subsets of states
Markov RequirementStrongly sensitiveLess harmed by Markov violations