RL-CA02: Coding Assignment 2 — Monte Carlo Methods
Overview
Implementation of Monte Carlo prediction and control methods on a Blackjack environment.
Files:
MC_lab.ipynb— Main notebookblackjack.py— Blackjack environmentmc_autograde.py— Autograding tests
What You Implement
- On-policy MC prediction: First-visit MC to estimate and
- On-policy MC control: With ε-greedy policy improvement
- Off-policy MC prediction: Using ordinary importance sampling
Key Implementation Details
First-Visit MC Prediction
# For each episode:
# 1. Generate episode following pi
# 2. Walk backwards through episode
# 3. Compute returns G, update V(s) with running average
G = 0
for t in reversed(range(len(episode))):
s, a, r = episode[t]
G = gamma * G + r
if s not in [x[0] for x in episode[:t]]: # first-visit check
N[s] += 1
V[s] += (G - V[s]) / N[s] # incremental averageOff-policy with Ordinary IS
Key: importance sampling ratio for episode from :
Incremental update (from HW2 Q1a):
Key Takeaways
- MC is model-free — doesn’t need transition probabilities
- First-visit MC: simpler, unbiased
- Ordinary IS: unbiased but high variance (visible in the plots)
- Weighted IS: biased but much lower variance (smoother convergence)
Related Homework
See RL-HW02 - Homework 2 for theoretical questions.