Baseline

Definition

A baseline is a reference value (typically depending on state only) that is subtracted from returns in policy gradient methods to reduce variance without introducing bias.

In policy gradient updates, we use:

where is the baseline, commonly a learned Value Function estimate .

Intuition

The Problem

In vanilla REINFORCE, all actions in a trajectory share credit/blame for the total return:

where is the return from the start of the episode. This means:

  • Good early actions get blamed for bad later actions
  • Bad early actions get credit for good later rewards
  • High variance: Lots of noise in the gradient estimates

The Solution

Subtract a baseline that represents “what was expected from this state”:

The baseline:

  • Reduces variance: Returns are centered around expected value
  • Doesn’t change expectation: w.r.t. actions sampled from
  • Helps credit assignment: Actions are compared to state-dependent baseline

Mathematical Formulation

Why Baselines Don’t Introduce Bias

The key insight:

The gradient of log probabilities sums to zero (since probabilities sum to 1):

Therefore: Subtracting any baseline maintains unbiasedness.

Causality-Aware Baselines

In practice, we use causality: action only affects rewards from time onward:

where is the return from step onward.

Advantage Function

When , the difference is the advantage:

This is a core concept in modern RL (Advantage function).

Key Properties/Variants

Choice of Baseline

  1. Constant baseline: (average return)

    • Simplest, provides some variance reduction
    • Not state-dependent
  2. Linear value function:

    • Parametric, simple to learn
    • Good for linear relationships
  3. Neural network value:

    • Highly expressive
    • Standard in modern deep RL
  4. Temporal difference targets:

    • One-step lookahead
    • Reduces variance further but introduces bias

Learning the Baseline

Typically minimize MSE on observed returns:

Update:

Or TD-style:

Variance Reduction Effectiveness

The amount of variance reduction depends on how well the baseline correlates with returns:

  • Bad baseline: Little variance reduction
  • Good baseline (close to actual ): Significant variance reduction
  • Perfect baseline (true ): Minimal variance remains

In practice, a learned value function usually provides substantial variance reduction even if imperfect.

Connections

Appears In