Advantage Function

Definition

The Advantage Function measures how much better taking a particular action is compared to the value of the state:

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$

It answers: “How much does this action improve upon the average (state value)?”

Intuition

Positive advantage: The action is better than the state baseline
Zero advantage: The action is average (equal to state value)
Negative advantage: The action is worse than baseline

Why It Matters

The advantage captures the relative quality of an action, not the absolute value. In policy gradient methods:

$\nabla J (θ) \propto E [\nabla lo g π (a ∣ s) Q (s, a)]$

Using advantage instead: $\nabla J (θ) \propto E [\nabla lo g π (a ∣ s) A (s, a)]$

Since $Q (s, a) = V (s) + A (s, a)$ , we can rewrite: $E [\nabla lo g π (a ∣ s) Q (s, a)] = E [\nabla lo g π (a ∣ s) V (s)] + E [\nabla lo g π (a ∣ s) A (s, a)]$

The first term is zero (policy’s gradient of state value is zero), leaving us with the advantage term.

Empirical Estimation

In practice, we don’t have access to $Q$ and $V$ directly. Common estimators:

1. Single-Step TD Advantage

$\hat{A}_{t} = r_{t} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$

Also called the temporal difference error or TD residual.

Bias: Only one step into the future (underestimates long-term effects)
Variance: Low (short horizon)

2. Multi-Step Advantage

$\hat{A}_{t}^{(n)} = \sum_{l = 0}^{n - 1} γ^{l} r_{t + l} + γ^{n} \hat{V} (s_{t + n}) - \hat{V} (s_{t})$

Bias: Decreases with more steps
Variance: Increases with more steps

3. Monte Carlo Advantage

$\hat{A}_{t}^{(\infty)} = G_{t} - \hat{V} (s_{t})$

where $G_{t} = \sum_{l = 0}^{\infty} γ^{l} r_{t + l}$ (full episode return)

Bias: Unbiased
Variance: High (full trajectory variance)

Advantage in Actor-Critic

The Actor-Critic Update

$θ_{t + 1} = θ_{t} + α \hat{A}_{t} \nabla_{θ} lo g π (a_{t} ∣ s_{t}, θ_{t})$

where $\hat{A}_{t}$ is the advantage estimate.

With Baseline

The baseline (typically the value function) is subtracted:

$\hat{A}_{t} = R_{t + 1} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$

Effect of baseline:

✓ Reduces variance (centering the signal)
✓ Remains unbiased (if value function is perfect)
✗ Introduces bias if value function is inaccurate

Generalized Advantage Estimation (GAE)

Rather than choosing one specific advantage estimator, GAE interpolates between all of them:

$\hat{A}_{t}^{GAE (γ, λ)} = (1 - λ) \sum_{l = 1}^{\infty} λ^{l - 1} \hat{A}_{t}^{(l)}$

or equivalently:

$\hat{A}_{t}^{GAE} = \sum_{l = 0}^{\infty} (γλ)^{l} δ_{t + l}$

where $δ_{t} = r_{t} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$ is the TD error.

Hyperparameter $λ$

$λ = 0$ : Single-step TD advantage (low bias, low variance)
$λ = 1$ : Monte Carlo advantage (unbiased, high variance)
$λ \in (0, 1)$ : Interpolation (tunable bias-variance tradeoff)

Common choice: $λ = 0.95$ or $λ = 0.99$

Properties

Unbiased Property

If the critic is perfect ( $\hat{V} = V^{π}$ ), then: $E [\hat{A}_{t}] = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t}) = A^{π} (s_{t}, a_{t})$

The advantage estimate is unbiased.

Variance Reduction

Subtracting the value function baseline reduces variance without introducing bias (with a perfect critic):

$Var [\hat{A}_{t}] < Var [\hat{Q}_{t}]$

Causality in TD

The TD advantage only uses future values starting from $s_{t + 1}$ :

$\hat{A}_{t} = r_{t} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$

This respects causality: action at $t$ cannot affect rewards before time $t$ .

Implementations

PyTorch Example

def compute_advantages(rewards, values, gamma=0.99, lambda=0.95):
    """Compute GAE advantages."""
    advantages = []
    gae = 0
    
    # Backward pass through trajectory
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = 0  # Terminal state
        else:
            next_value = values[t + 1]
        
        # TD error
        delta = rewards[t] + gamma * next_value - values[t]
        
        # Accumulate GAE
        gae = delta + gamma * lambda * gae
        advantages.insert(0, gae)
    
    return torch.tensor(advantages)

Comparison: Advantage Estimators

Estimator	Formula	Bias	Variance
1-step TD	$r_{t} + γV (s_{t + 1}) - V (s_{t})$	High	Low
n-step	$\sum_{l = 0}^{n - 1} γ^{l} r_{t + l} + γ^{n} V (s_{t + n}) - V (s_{t})$	Medium	Medium
MC	$G_{t} - V (s_{t})$	None	High
GAE	$\sum_{l = 0}^{\infty} (γλ)^{l} δ_{t + l}$	Tunable	Tunable

Connections

Used in: Actor-Critic, Policy Gradient Methods, PPO
Related to: Value Function, Temporal Difference Learning, Generalized Advantage Estimation
Appears in: A3C, A2C, SAC, TRPO

Key References

Schulman, G., et al. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.
Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning (A3C). ICML.
Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms. Arxiv.

Study Notes

Explorer

Advantage Function

Advantage Function

Definition

Intuition

Why It Matters

Empirical Estimation

1. Single-Step TD Advantage

2. Multi-Step Advantage

3. Monte Carlo Advantage

Advantage in Actor-Critic

The Actor-Critic Update

With Baseline

Generalized Advantage Estimation (GAE)

Hyperparameter $λ$

Properties

Unbiased Property

Variance Reduction

Causality in TD

Implementations

PyTorch Example

Comparison: Advantage Estimators

Connections

Key References

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

Advantage Function

Advantage Function

Definition

Intuition

Why It Matters

Empirical Estimation

1. Single-Step TD Advantage

2. Multi-Step Advantage

3. Monte Carlo Advantage

Advantage in Actor-Critic

The Actor-Critic Update

With Baseline

Generalized Advantage Estimation (GAE)

Hyperparameter λ

Properties

Unbiased Property

Variance Reduction

Causality in TD

Implementations

PyTorch Example

Comparison: Advantage Estimators

Connections

Key References

Graph View

Table of Contents

Backlinks

Hyperparameter $λ$