RL-L03: Monte Carlo Methods

Overview

Monte Carlo (MC) methods learn value functions and optimal policies from experience in the form of sample episodes.

Key Characteristics

Model-free: Unlike Dynamic Programming, MC does not require knowledge of MDP dynamics ( $p (s^{'}, r ∣ s, a)$ ).
Averages Returns: Estimates are based on averaging sample returns for each state-action pair.
Episodic Tasks: Defined only for episodic tasks as it requires the completion of an episode to calculate the return $G_{t}$ .
No Bootstrapping: MC methods do not update estimates based on other estimates; they use actual sampled returns.

MC vs. DP Comparison

Feature	Dynamic Programming	Monte Carlo Methods
Model	Needs full $p (s^{'}, r ∥ s, a)$	Model-free (Sample experience)
Bootstrapping	Yes (Updates based on next state values)	No (Updates based on returns)
Width	Full (Expectation over all transitions)	Single (Sample trajectory)
Depth	1-step lookahead	Full (Until end of episode)

1. Monte Carlo Prediction

The goal of prediction is to estimate the state-value function $v_{π} (s)$ under a fixed policy $π$ .

The Return

For a trajectory $S_{t}, A_{t}, R_{t + 1}, S_{t + 1}, \dots, S_{T}$ , the return is: $G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots + γ^{T - t - 1} R_{T}$ By the law of large numbers, the average return converges to the expected value: $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

First-Visit vs. Every-Visit MC

First-Visit MC: Averages returns only from the first time a state $s$ is visited in each episode.
- Each return is an i.i.d. estimate of $v_{π} (s)$ .
- Convergence is $O (1/ n)$ .
Every-Visit MC: Averages returns from all visits to $s$ in each episode.
- Estimates are not independent but also converge to $v_{π} (s)$ quadratically.

First-Visit MC Prediction Algorithm

Algorithm: First-visit MC prediction

Input: a policy $π$ to be evaluated Initialize: $V (s) \in R$ arbitrarily, $R e t u r n s (s) \leftarrow$ empty list Loop forever (for each episode):

Generate an episode following $π$ : $S_{0}, A_{0}, R_{1}, \dots, S_{T}$

$G \leftarrow 0$

Loop backwards $t = T - 1, T - 2, \dots, 0$ :

$G \leftarrow γ G + R_{t + 1}$

Unless $S_{t}$ appears in $S_{0}, S_{1}, \dots, S_{t - 1}$ :

Append $G$ to $R e t u r n s (S_{t})$

$V (S_{t}) \leftarrow average (R e t u r n s (S_{t}))$

2. Blackjack Example

Blackjack is a classic episodic MDP used to illustrate MC prediction.

Objective: Maximize card sum $\leq 21$ .
State Space:
- Current sum (12-21)
- Dealer’s showing card (Ace-10)
- Usable Ace (Yes/No)
- Total: 200 states.
Rewards: +1 for win, -1 for loss, 0 for draw.
Action: Hit or Stick.
Policy Evaluation: Average returns over thousands of simulated games (episodes).
Observation: States with usable aces are less frequent and thus have higher variance in the value function estimate.

3. Monte Carlo Control

Control aims to approximate optimal policies using Generalized Policy Iteration (GPI).

Action Values ( $Q$ )

Without a model, state values $V (s)$ are insufficient for control (cannot look ahead). We must estimate action-value functions $Q (s, a)$ . $q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

The Exploration-Exploitation Dilemma

Many state-action pairs might never be visited if $π$ is deterministic. Two solutions:

Exploring Starts: Assume every episode starts at a random state-action pair with non-zero probability.
On-Policy: Use $ϵ$ -greedy policies.
Off-Policy: Use a separate behavior policy to explore.

Algorithm: Monte Carlo ES (Exploring Starts)

This algorithm alternates between evaluation and improvement episode-by-episode.

Algorithm: Monte Carlo ES

Initialize: $π (s)$ arbitrarily, $Q (s, a)$ arbitrarily, $R e t u r n s (s, a)$ empty Loop forever:

Choose $S_{0}, A_{0}$ such that all pairs have probability $> 0$ (Exploring Starts)

Generate episode from $S_{0}, A_{0}$ following $π$ : $S_{0}, A_{0}, R_{1}, \dots, S_{T}$

$G \leftarrow 0$

Loop backwards $t = T - 1, \dots, 0$ :

$G \leftarrow γ G + R_{t + 1}$

Unless $(S_{t}, A_{t})$ appeared earlier in the episode:

Append $G$ to $R e t u r n s (S_{t}, A_{t})$

$Q (S_{t}, A_{t}) \leftarrow average (R e t u r n s (S_{t}, A_{t}))$

$π (S_{t}) \leftarrow argmax_{a} Q (S_{t}, a)$

4. On-Policy MC Control ( $ϵ$ -greedy)

Avoids exploring starts by using a soft policy (e.g., $ϵ$ -greedy).

$ϵ$ -greedy Improvement

For an $ϵ$ -soft policy $π$ , an $ϵ$ -greedy policy $π^{'}$ wrt $q_{π}$ is an improvement ( $\forall s, v_{π^{'}} (s) \geq v_{π} (s)$ ).

π (a ∣ s) = {1 - ϵ + \frac{ϵ}{∣ A ( s ) ∣} \frac{ϵ}{∣ A ( s ) ∣} if a = argmax Q (s, a) otherwise

Proof Idea (PIT): $q_{π} (s, π^{'} (s)) = \sum_{a} π^{'} (a ∣ s) q_{π} (s, a) = \frac{ϵ}{∣ A ∣} \sum_{a} q_{π} (s, a) + (1 - ϵ) max_{a} q_{π} (s, a) \geq v_{π} (s)$

5. Off-Policy Prediction and Control

Learn about a target policy $π$ while following a behavior policy $b$ ( $b \neq = π$ ).

Coverage Assumption

The behavior policy $b$ must be able to take any action that $π$ might take: $π (a ∣ s) > 0 ⟹ b (a ∣ s) > 0$

Importance Sampling (IS) Ratio

To transform expectations from $b$ to $π$ , we weight returns by the probability of the trajectory occurring under $π$ vs. $b$ : $ρ_{t : T - 1} = \frac{\prod _{k = t}^{T - 1} π ( A _{k} ∣ S _{k} ) p ( S _{k + 1} ∣ S _{k} , A _{k} )}{\prod _{k = t}^{T - 1} b ( A _{k} ∣ S _{k} ) p ( S _{k + 1} ∣ S _{k} , A _{k} )} = \prod_{k = t}^{T - 1} \frac{π ( A _{k} ∣ S _{k} )}{b ( A _{k} ∣ S _{k} )}$ Note: Transition dynamics $p$ cancel out!

Types of Importance Sampling

Ordinary IS: Simple average of scaled returns.
- Unbiased, but can have infinite variance. $V (s) = \frac{\sum _{t \in T (s)} ρ _{t : T - 1} G _{t}}{∣ T ( s ) ∣}$
Weighted IS: Weighted average of scaled returns.
- Biased (bias $\to 0$ as $n \to \infty$ ), but finite variance. $V (s) = \frac{\sum _{t \in T (s)} ρ _{t : T - 1} G _{t}}{\sum _{t \in T (s)} ρ _{t : T - 1}}$

Algorithm: Off-policy MC Control

Algorithm: Off-policy MC Control

Initialize: $Q (s, a)$ arbitrarily, $C (s, a) \leftarrow 0$ , $π (s) \leftarrow argmax_{a} Q (s, a)$ Loop forever:

Select soft behavior policy $b$ ; Generate episode following $b$ : $S_{0}, A_{0}, \dots, S_{T}$

$G \leftarrow 0, W \leftarrow 1$

Loop backwards $t = T - 1, \dots, 0$ :

$G \leftarrow γ G + R_{t + 1}$

$C (S_{t}, A_{t}) \leftarrow C (S_{t}, A_{t}) + W$

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + \frac{W}{C ( S _{t} , A _{t} )} [G - Q (S_{t}, A_{t})]$

$π (S_{t}) \leftarrow argmax_{a} Q (S_{t}, a)$

If $A_{t} \neq = π (S_{t})$ then exit inner loop

$W \leftarrow W \frac{1}{b ( A _{t} ∣ S _{t} )}$

6. Incremental Implementation

Weighted IS can be implemented incrementally to avoid storing all returns. Given $G_{1}, \dots, G_{n}$ with weights $W_{1}, \dots, W_{n}$ :

$C_{n} = C_{n - 1} + W_{n}$
$V_{n + 1} = V_{n} + \frac{W _{n}}{C _{n}} [G_{n} - V_{n}]$

7. Diagrams

Backup Diagram: MC Prediction

      (S_t)       <-- Root (state to update)
        |
     [A_t, R_{t+1}]
        |
      (S_{t+1})
        |
     [A_{t+1}, R_{t+2}]
        |
       ...
        |
      ((T))       <-- Terminal (end of episode)

Contrast with DP: MC looks at a single, full trajectory.

Summary Key Points

MC learns from experience, avoiding the need for environment models.
Goal: Average returns to estimate expectations.
GPI applies: use evaluation (averaging) and improvement (greedy/ $ϵ$ -greedy).
Off-policy requires Importance Sampling to account for different behavior.
Variance is the main challenge in MC, especially in Off-policy IS.

Study Notes

Explorer

RL-L03 - Monte Carlo Methods

RL-L03: Monte Carlo Methods

Overview

Key Characteristics

MC vs. DP Comparison

1. Monte Carlo Prediction

The Return

First-Visit vs. Every-Visit MC

First-Visit MC Prediction Algorithm

2. Blackjack Example

3. Monte Carlo Control

Action Values ( $Q$ )

The Exploration-Exploitation Dilemma

Algorithm: Monte Carlo ES (Exploring Starts)

4. On-Policy MC Control ( $ϵ$ -greedy)

$ϵ$ -greedy Improvement

5. Off-Policy Prediction and Control

Coverage Assumption

Importance Sampling (IS) Ratio

Types of Importance Sampling

Algorithm: Off-policy MC Control

6. Incremental Implementation

7. Diagrams

Backup Diagram: MC Prediction

Summary Key Points

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-L03 - Monte Carlo Methods

RL-L03: Monte Carlo Methods

Overview

Key Characteristics

MC vs. DP Comparison

1. Monte Carlo Prediction

The Return

First-Visit vs. Every-Visit MC

First-Visit MC Prediction Algorithm

2. Blackjack Example

3. Monte Carlo Control

Action Values (Q)

The Exploration-Exploitation Dilemma

Algorithm: Monte Carlo ES (Exploring Starts)

4. On-Policy MC Control (ϵ-greedy)

ϵ-greedy Improvement

5. Off-Policy Prediction and Control

Coverage Assumption

Importance Sampling (IS) Ratio

Types of Importance Sampling

Algorithm: Off-policy MC Control

6. Incremental Implementation

7. Diagrams

Backup Diagram: MC Prediction

Summary Key Points

Graph View

Table of Contents

Backlinks

Action Values ( $Q$ )

4. On-Policy MC Control ( $ϵ$ -greedy)

$ϵ$ -greedy Improvement