Importance Sampling
Definition
Importance Sampling
Importance sampling is a technique for estimating expected values under one distribution using samples from another distribution. In RL, it enables off-policy learning: learning about a target policy from data generated by a different behavior policy .
The Problem
We want to estimate , but our data comes from following behavior policy . The returns observed under have the wrong distribution.
Importance Sampling Ratio
Importance Sampling Ratio
where:
- — probability of taking action in state under target policy
- — probability under behavior policy
- The product is over all steps from to end of episode
What the Ratio Does
It reweights each trajectory to correct for the fact that it was generated under instead of . If would have been more likely to take the observed actions, (upweight). If less likely, (downweight). If would never take an observed action (), then (ignore this trajectory entirely).
Coverage Requirement
Coverage Assumption
For importance sampling to work, the behavior policy must cover the target policy:
If there’s an action might take that would never take, we can never estimate its value. ε-soft policies satisfy this automatically.
Two Variants
Ordinary Importance Sampling
Ordinary IS
Simple average of importance-weighted returns.
- ✅ Unbiased
- ❌ High (potentially infinite) variance — a single large can dominate
Weighted Importance Sampling
Weighted IS
Weighted average where the denominator normalizes by the sum of ratios.
- ❌ Biased (for finite samples, bias → 0 asymptotically)
- ✅ Much lower variance — preferred in practice
- First-episode estimate is always (ratio cancels), which has bounded error
Incremental Implementation
For weighted IS with per-episode updates: where is the cumulative sum of weights.
In Off-Policy Methods
- Monte Carlo Methods: Multiply entire episode return by
- Temporal Difference Learning: Per-step importance ratio
- With Function Approximation: Semi-gradient off-policy TD uses per-step
Variance Problem
Variance Explosion
The product can be astronomically large or zero, making ordinary IS highly unstable. This is a fundamental challenge of off-policy MC methods. Weighted IS helps, but variance remains an issue for long episodes with very different policies.
Connections
- Enables: Off-Policy learning
- Used in: Monte Carlo Methods (off-policy MC control), Semi-Gradient Methods (off-policy TD)
- Problem: variance → motivates alternatives like Q-Learning (which avoids IS entirely for control)
- Extended by: Per-decision IS, discounting-aware IS