Advantage Function

Definition

The Advantage Function measures how much better taking a particular action is compared to the value of the state:

It answers: “How much does this action improve upon the average (state value)?”


Intuition

  • Positive advantage: The action is better than the state baseline
  • Zero advantage: The action is average (equal to state value)
  • Negative advantage: The action is worse than baseline

Why It Matters

The advantage captures the relative quality of an action, not the absolute value. In policy gradient methods:

Using advantage instead:

Since , we can rewrite:

The first term is zero (policy’s gradient of state value is zero), leaving us with the advantage term.


Empirical Estimation

In practice, we don’t have access to and directly. Common estimators:

1. Single-Step TD Advantage

Also called the temporal difference error or TD residual.

Bias: Only one step into the future (underestimates long-term effects)
Variance: Low (short horizon)

2. Multi-Step Advantage

Bias: Decreases with more steps
Variance: Increases with more steps

3. Monte Carlo Advantage

where (full episode return)

Bias: Unbiased
Variance: High (full trajectory variance)


Advantage in Actor-Critic

The Actor-Critic Update

where is the advantage estimate.

With Baseline

The baseline (typically the value function) is subtracted:

Effect of baseline:

  • ✓ Reduces variance (centering the signal)
  • ✓ Remains unbiased (if value function is perfect)
  • ✗ Introduces bias if value function is inaccurate

Generalized Advantage Estimation (GAE)

Rather than choosing one specific advantage estimator, GAE interpolates between all of them:

or equivalently:

where is the TD error.

Hyperparameter

  • : Single-step TD advantage (low bias, low variance)
  • : Monte Carlo advantage (unbiased, high variance)
  • : Interpolation (tunable bias-variance tradeoff)

Common choice: or


Properties

Unbiased Property

If the critic is perfect (), then:

The advantage estimate is unbiased.

Variance Reduction

Subtracting the value function baseline reduces variance without introducing bias (with a perfect critic):

Causality in TD

The TD advantage only uses future values starting from :

This respects causality: action at cannot affect rewards before time .


Implementations

PyTorch Example

def compute_advantages(rewards, values, gamma=0.99, lambda=0.95):
    """Compute GAE advantages."""
    advantages = []
    gae = 0
    
    # Backward pass through trajectory
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = 0  # Terminal state
        else:
            next_value = values[t + 1]
        
        # TD error
        delta = rewards[t] + gamma * next_value - values[t]
        
        # Accumulate GAE
        gae = delta + gamma * lambda * gae
        advantages.insert(0, gae)
    
    return torch.tensor(advantages)

Comparison: Advantage Estimators

EstimatorFormulaBiasVariance
1-step TDHighLow
n-stepMediumMedium
MCNoneHigh
GAETunableTunable

Connections


Key References

  1. Schulman, G., et al. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.
  2. Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning (A3C). ICML.
  3. Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms. Arxiv.