Importance Sampling

Definition

Importance Sampling

Importance sampling is a technique for estimating expected values under one distribution using samples from another distribution. In RL, it enables off-policy learning: learning about a target policy $π$ from data generated by a different behavior policy $b$ .

The Problem

We want to estimate $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$ , but our data comes from following behavior policy $b \neq = π$ . The returns $G_{t}$ observed under $b$ have the wrong distribution.

Importance Sampling Ratio

Importance Sampling Ratio

$ρ_{t : T - 1} = \prod_{k = t}^{T - 1} \frac{π ( A _{k} ∣ S _{k} )}{b ( A _{k} ∣ S _{k} )}$

where:

$π (A_{k} ∣ S_{k})$ — probability of taking action $A_{k}$ in state $S_{k}$ under target policy

$b (A_{k} ∣ S_{k})$ — probability under behavior policy

The product is over all steps from $t$ to end of episode

What the Ratio Does

It reweights each trajectory to correct for the fact that it was generated under $b$ instead of $π$ . If $π$ would have been more likely to take the observed actions, $ρ > 1$ (upweight). If less likely, $ρ < 1$ (downweight). If $π$ would never take an observed action ( $π (a ∣ s) = 0$ ), then $ρ = 0$ (ignore this trajectory entirely).

Coverage Requirement

Coverage Assumption

For importance sampling to work, the behavior policy must cover the target policy: $π (a ∣ s) > 0 ⟹ b (a ∣ s) > 0 \forall s, a$

If there’s an action $π$ might take that $b$ would never take, we can never estimate its value. ε-soft policies satisfy this automatically.

Two Variants

Ordinary Importance Sampling

Ordinary IS

$V (s) = \frac{\sum _{t \in T (s)} ρ _{t : T (t) - 1} G _{t}}{∣ T ( s ) ∣}$

Simple average of importance-weighted returns.

✅ Unbiased

❌ High (potentially infinite) variance — a single large $ρ$ can dominate

Weighted Importance Sampling

Weighted IS

$V (s) = \frac{\sum _{t \in T (s)} ρ _{t : T (t) - 1} G _{t}}{\sum _{t \in T (s)} ρ _{t : T (t) - 1}}$

Weighted average where the denominator normalizes by the sum of ratios.

❌ Biased (for finite samples, bias → 0 asymptotically)

✅ Much lower variance — preferred in practice

First-episode estimate is always $V (s) = G_{t}$ (ratio cancels), which has bounded error

Incremental Implementation

For weighted IS with per-episode updates: $V_{n + 1} = V_{n} + \frac{W _{n}}{C _{n}} [G_{n} - V_{n}]$ where $C_{n} = C_{n - 1} + W_{n}$ is the cumulative sum of weights.

In Off-Policy Methods

Monte Carlo Methods: Multiply entire episode return by $ρ_{t : T - 1}$
Temporal Difference Learning: Per-step importance ratio $ρ_{t} = \frac{π ( A _{t} ∣ S _{t} )}{b ( A _{t} ∣ S _{t} )}$
With Function Approximation: Semi-gradient off-policy TD uses per-step $ρ_{t}$

Variance Problem

Variance Explosion

The product $ρ_{t : T - 1}$ can be astronomically large or zero, making ordinary IS highly unstable. This is a fundamental challenge of off-policy MC methods. Weighted IS helps, but variance remains an issue for long episodes with very different policies.

Connections

Enables: Off-Policy learning
Used in: Monte Carlo Methods (off-policy MC control), Semi-Gradient Methods (off-policy TD)
Problem: variance → motivates alternatives like Q-Learning (which avoids IS entirely for control)
Extended by: Per-decision IS, discounting-aware IS

Study Notes

Explorer

Importance Sampling

Importance Sampling

Definition

The Problem

Importance Sampling Ratio

Coverage Requirement

Two Variants

Ordinary Importance Sampling

Weighted Importance Sampling

Incremental Implementation

In Off-Policy Methods

Variance Problem

Connections

Appears In

Graph View

Table of Contents

Backlinks