Importance Sampling

Definition

Importance Sampling

Importance sampling is a technique for estimating expected values under one distribution using samples from another distribution. In RL, it enables off-policy learning: learning about a target policy from data generated by a different behavior policy .

The Problem

We want to estimate , but our data comes from following behavior policy . The returns observed under have the wrong distribution.

Importance Sampling Ratio

Importance Sampling Ratio

where:

  • — probability of taking action in state under target policy
  • — probability under behavior policy
  • The product is over all steps from to end of episode

What the Ratio Does

It reweights each trajectory to correct for the fact that it was generated under instead of . If would have been more likely to take the observed actions, (upweight). If less likely, (downweight). If would never take an observed action (), then (ignore this trajectory entirely).

Coverage Requirement

Coverage Assumption

For importance sampling to work, the behavior policy must cover the target policy:

If there’s an action might take that would never take, we can never estimate its value. ε-soft policies satisfy this automatically.

Two Variants

Ordinary Importance Sampling

Ordinary IS

Simple average of importance-weighted returns.

  • Unbiased
  • High (potentially infinite) variance — a single large can dominate

Weighted Importance Sampling

Weighted IS

Weighted average where the denominator normalizes by the sum of ratios.

  • Biased (for finite samples, bias → 0 asymptotically)
  • Much lower variance — preferred in practice
  • First-episode estimate is always (ratio cancels), which has bounded error

Incremental Implementation

For weighted IS with per-episode updates: where is the cumulative sum of weights.

In Off-Policy Methods

Variance Problem

Variance Explosion

The product can be astronomically large or zero, making ordinary IS highly unstable. This is a fundamental challenge of off-policy MC methods. Weighted IS helps, but variance remains an issue for long episodes with very different policies.

Connections

Appears In