Baseline
Definition
A baseline is a reference value (typically depending on state only) that is subtracted from returns in policy gradient methods to reduce variance without introducing bias.
In policy gradient updates, we use:
where is the baseline, commonly a learned Value Function estimate .
Intuition
The Problem
In vanilla REINFORCE, all actions in a trajectory share credit/blame for the total return:
where is the return from the start of the episode. This means:
- Good early actions get blamed for bad later actions
- Bad early actions get credit for good later rewards
- High variance: Lots of noise in the gradient estimates
The Solution
Subtract a baseline that represents “what was expected from this state”:
The baseline:
- Reduces variance: Returns are centered around expected value
- Doesn’t change expectation: w.r.t. actions sampled from
- Helps credit assignment: Actions are compared to state-dependent baseline
Mathematical Formulation
Why Baselines Don’t Introduce Bias
The key insight:
The gradient of log probabilities sums to zero (since probabilities sum to 1):
Therefore: Subtracting any baseline maintains unbiasedness.
Causality-Aware Baselines
In practice, we use causality: action only affects rewards from time onward:
where is the return from step onward.
Advantage Function
When , the difference is the advantage:
This is a core concept in modern RL (Advantage function).
Key Properties/Variants
Choice of Baseline
-
Constant baseline: (average return)
- Simplest, provides some variance reduction
- Not state-dependent
-
Linear value function:
- Parametric, simple to learn
- Good for linear relationships
-
Neural network value:
- Highly expressive
- Standard in modern deep RL
-
Temporal difference targets:
- One-step lookahead
- Reduces variance further but introduces bias
Learning the Baseline
Typically minimize MSE on observed returns:
Update:
Or TD-style:
Variance Reduction Effectiveness
The amount of variance reduction depends on how well the baseline correlates with returns:
- Bad baseline: Little variance reduction
- Good baseline (close to actual ): Significant variance reduction
- Perfect baseline (true ): Minimal variance remains
In practice, a learned value function usually provides substantial variance reduction even if imperfect.
Connections
- Versus: Full trajectory return (high variance)
- Related to: Advantage function ()
- Learned via: Temporal Difference Learning or Monte Carlo Methods
- Core in: Actor-Critic (separate value baseline) and A2C algorithms
- Implies: Value Function is useful even in policy-gradient methods
Appears In
- Policy Gradient Methods — Variance reduction technique
- REINFORCE — Common improvement (REINFORCE with baseline)
- Actor-Critic — Separates actor (policy) from critic (baseline/value)
- Advantage Actor-Critic (A2C) — Uses value baseline
- PPO — Reduces variance significantly
- Deep Reinforcement Learning — Essential for sample efficiency