Baseline

Definition

A baseline is a reference value (typically depending on state only) that is subtracted from returns in policy gradient methods to reduce variance without introducing bias.

In policy gradient updates, we use:

$\nabla_{θ} J \approx E [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot (G_{t} - b (s_{t}))]$

where $b (s)$ is the baseline, commonly a learned Value Function estimate $V (s)$ .

Intuition

The Problem

In vanilla REINFORCE, all actions in a trajectory share credit/blame for the total return:

$\nabla_{θ} J \propto \nabla_{θ} lo g π_{θ} (a ∣ s) \cdot G$

where $G$ is the return from the start of the episode. This means:

Good early actions get blamed for bad later actions
Bad early actions get credit for good later rewards
High variance: Lots of noise in the gradient estimates

The Solution

Subtract a baseline that represents “what was expected from this state”:

$G_{t} - V (s_{t}) = Actual return - Expected return = Advantage$

The baseline:

Reduces variance: Returns are centered around expected value
Doesn’t change expectation: $E [b (s)] = 0$ w.r.t. actions sampled from $π (a ∣ s)$
Helps credit assignment: Actions are compared to state-dependent baseline

Mathematical Formulation

Why Baselines Don’t Introduce Bias

The key insight:

$E_{a \sim π} [\nabla_{θ} lo g π (a ∣ s) \cdot b (s)] = b (s) E_{a \sim π} [\nabla_{θ} lo g π (a ∣ s)]$

The gradient of log probabilities sums to zero (since probabilities sum to 1):

$E_{a \sim π} [\nabla_{θ} lo g π (a ∣ s)] = 0$

Therefore: Subtracting any baseline maintains unbiasedness.

Causality-Aware Baselines

In practice, we use causality: action $a_{t}$ only affects rewards from time $t$ onward:

$\nabla_{θ} J = E [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π (a_{t} ∣ s_{t}) (G_{t} - b (s_{t}))]$

where $G_{t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{t^{'}}$ is the return from step $t$ onward.

Advantage Function

When $b (s_{t}) = V (s_{t})$ , the difference is the advantage:

$A_{t} = G_{t} - V (s_{t}) = how much better than expected$

This is a core concept in modern RL (Advantage function).

Key Properties/Variants

Choice of Baseline

Constant baseline: $b (s) = c$ (average return)
- Simplest, provides some variance reduction
- Not state-dependent
Linear value function: $V (s) = w^{T} ϕ (s)$
- Parametric, simple to learn
- Good for linear relationships
Neural network value: $V (s) = NN_{w} (s)$
- Highly expressive
- Standard in modern deep RL
Temporal difference targets: $V (s) \approx r + γV (s^{'})$
- One-step lookahead
- Reduces variance further but introduces bias

Learning the Baseline

Typically minimize MSE on observed returns:

$L_{V} = E [(G_{t} - V (s_{t}))^{2}]$

Update: $w \leftarrow w - β \nabla_{w} (G_{t} - V (s_{t}))^{2}$

Or TD-style:

$L_{V} = E [(r_{t} + γV (s_{t + 1}) - V (s_{t}))^{2}]$

Variance Reduction Effectiveness

The amount of variance reduction depends on how well the baseline correlates with returns:

Bad baseline: Little variance reduction
Good baseline (close to actual $V (s)$ ): Significant variance reduction
Perfect baseline (true $V (s)$ ): Minimal variance remains

In practice, a learned value function usually provides substantial variance reduction even if imperfect.

Connections

Versus: Full trajectory return (high variance)
Related to: Advantage function ( $G_{t} - V (s)$ )
Learned via: Temporal Difference Learning or Monte Carlo Methods
Core in: Actor-Critic (separate value baseline) and A2C algorithms
Implies: Value Function is useful even in policy-gradient methods

Appears In

Policy Gradient Methods — Variance reduction technique
REINFORCE — Common improvement (REINFORCE with baseline)
Actor-Critic — Separates actor (policy) from critic (baseline/value)
Advantage Actor-Critic (A2C) — Uses value baseline
PPO — Reduces variance significantly
Deep Reinforcement Learning — Essential for sample efficiency

Study Notes

Explorer

Baseline

Baseline

Definition

Intuition

The Problem

The Solution

Mathematical Formulation

Why Baselines Don’t Introduce Bias

Causality-Aware Baselines

Advantage Function

Key Properties/Variants

Choice of Baseline

Learning the Baseline

Variance Reduction Effectiveness

Connections

Appears In

Graph View

Table of Contents

Backlinks