Generalized Advantage Estimation (GAE)

Definition

Generalized Advantage Estimation is a method for estimating the advantage function that interpolates between single-step temporal difference (low bias, low variance) and Monte Carlo (unbiased, high variance) advantage estimates using a single hyperparameter $λ$ .

Problem It Solves

The Bias-Variance Tradeoff

Different advantage estimators have different properties:

1-step TD advantage: $\hat{A}_{t}^{(1)} = δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$
- ✓ Low variance
- ✗ High bias (only one step)
n-step advantage: $\hat{A}_{t}^{(n)} = \sum_{l = 0}^{n - 1} γ^{l} δ_{t + l}$
- Medium bias and variance
Monte Carlo advantage: $\hat{A}_{t}^{(\infty)} = G_{t} - V (s_{t})$
- ✓ Unbiased
- ✗ High variance (depends on entire trajectory)

GAE’s solution: Use a weighted combination of all these, controlled by a single $λ$ parameter.

Mathematical Formulation

GAE Definition

$\hat{A}_{t}^{GAE (γ, λ)} = (1 - λ) \sum_{l = 1}^{\infty} λ^{l - 1} \hat{A}_{t}^{(l)}$

where $\hat{A}_{t}^{(l)}$ is the $l$ -step advantage.

Equivalent Form (TD Residual)

GAE can be expressed as an exponential sum of temporal difference errors:

$\hat{A}_{t}^{GAE} = \sum_{l = 0}^{\infty} (γλ)^{l} δ_{t + l}$

where $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ is the TD residual (1-step advantage).

Recursive Computation

For efficient computation, compute advantages backwards through the trajectory:

$A_{t} = δ_{t} + (γλ) A_{t + 1}$

with $A_{T} = 0$ at the terminal state.

Pseudocode:

gae = 0
for t in reversed(range(T)):
    delta_t = rewards[t] + gamma * V(s_{t+1}) - V(s_t)
    gae = delta_t + gamma * lambda * gae
    advantages[t] = gae

Hyperparameter $λ$

The parameter $λ \in [0, 1]$ controls the bias-variance tradeoff:

$λ = 0$ (TD)

$\hat{A}_{t}^{GAE (0)} = δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$

Only uses immediate reward and next state value
Bias: High (only 1-step lookahead)
Variance: Low

$λ = 1$ (Monte Carlo)

$\hat{A}_{t}^{GAE (1)} = G_{t} - V (s_{t})$

Uses entire trajectory (full return $G_{t}$ )
Bias: Zero (unbiased)
Variance: High (depends on full episode)

$λ \in (0, 1)$ (Interpolation)

Smooth tradeoff between bias and variance.

Common choices:

$λ = 0.95$ (slightly favor MC, good for most domains)
$λ = 0.99$ (more MC-like, less bias)
$λ = 0.97$ (balanced)

Intuition: Why the Weighted Combination?

The weights $(1 - λ) λ^{l - 1}$ give exponentially decaying importance to longer TD chains:

1-step: weight $(1 - λ)$
2-step: weight $(1 - λ) λ$
3-step: weight $(1 - λ) λ^{2}$
…

Lower $λ$ : Weight concentrated on short steps (low variance)
Higher $λ$ : Weight spread across longer steps (less bias)

Properties

1. Exponential Weighting

The decay factor $(γλ)$ ensures:

Distant future terms contribute exponentially less
Numerical stability (sum converges)

2. Consistency

At $λ = 0$ : Consistent with 1-step TD
At $λ = 1$ : Consistent with MC return
At $λ \in (0, 1)$ : Smooth interpolation

3. Causality

Each $δ_{t}$ depends only on the current transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ : $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$

This respects causality: action at $t$ only affects future rewards.

4. Off-Policy Extension

GAE can be extended for off-policy learning using importance sampling: $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ $ρ_{t} = min (1, \frac{π ( a _{t} ∣ s _{t} )}{β ( a _{t} ∣ s _{t} )}) (clipped importance ratio)$ $(off-policy A_{t}) = ρ_{t} δ_{t} + (γλ ρ_{t + 1}) A_{t + 1}$

Comparison with Alternatives

Estimator	Formula	$λ$	Bias	Variance
TD(0)	$δ_{t}$	0	High	Low
TD( $λ$ )	Trace-based	$λ$	Medium	Medium
GAE( $λ$ )	$\sum (γλ)^{l} δ_{t + l}$	$λ$	Tunable	Tunable
MC	$G_{t} - V (s_{t})$	1	None	High

Algorithm: A3C with GAE

def compute_gae_advantages(trajectory, values, gamma=0.99, lambda=0.95):
    """Compute GAE advantages for a trajectory."""
    advantages = []
    gae = 0
    
    for t in reversed(range(len(trajectory))):
        state, action, reward, next_state, done = trajectory[t]
        value = values[state]
        next_value = 0 if done else values[next_state]
        
        # TD residual
        delta = reward + gamma * next_value - value
        
        # Exponential sum: A_t = delta_t + (gamma * lambda) * A_{t+1}
        gae = delta + (gamma * lambda) * gae
        advantages.insert(0, gae)
    
    return advantages
 
def policy_update(advantages, log_probs):
    """Update policy using GAE advantages."""
    policy_loss = -(torch.stack(log_probs) * advantages).mean()
    return policy_loss

Empirical Performance

GAE is widely used because:

✓ Practical: Single hyperparameter controls tradeoff
✓ Efficient: Backward pass is $O (T)$ complexity
✓ Flexible: Works with different value function approximators
✓ Empirically strong: Consistently outperforms pure TD or MC

Typical results: With $λ \approx 0.95$ , GAE achieves:

Lower sample complexity than MC
Lower bias than TD
Faster convergence than either alone

When using GAE, also tune:

$γ$ (Discount Factor)

Affects TD residual magnitude
Typically: $γ = 0.99$ for long horizons

Value Function Learning Rate

GAE quality depends on accurate $V (s)$ estimates
Needs sufficient critic updates per policy update

Entropy Coefficient (for entropy regularization)

Can be paired with GAE in policy methods
Encourages exploration

Connections

Extends: Advantage Function, Temporal Difference Learning
Used in: A3C, A2C, PPO, TRPO, SAC
Related to: Bias-Variance Trade-off, Value Function
Appears in: Actor-Critic, Policy Gradient Methods

Key References

Schulman, G., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.
- Original paper introducing GAE
Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning (A3C). ICML.
- Uses GAE in asynchronous policy gradient method
Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms (PPO). Arxiv.
- Standard method using GAE for advantage estimation

Study Notes

Explorer

Generalized Advantage Estimation

Generalized Advantage Estimation (GAE)

Definition

Problem It Solves

The Bias-Variance Tradeoff

Mathematical Formulation

GAE Definition

Equivalent Form (TD Residual)

Recursive Computation

Hyperparameter $λ$

$λ = 0$ (TD)

$λ = 1$ (Monte Carlo)

$λ \in (0, 1)$ (Interpolation)

Intuition: Why the Weighted Combination?

Properties

1. Exponential Weighting

2. Consistency

3. Causality

4. Off-Policy Extension

Comparison with Alternatives

Algorithm: A3C with GAE

Empirical Performance

$γ$ (Discount Factor)

Value Function Learning Rate

Entropy Coefficient (for entropy regularization)

Connections

Key References

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

Generalized Advantage Estimation

Generalized Advantage Estimation (GAE)

Definition

Problem It Solves

The Bias-Variance Tradeoff

Mathematical Formulation

GAE Definition

Equivalent Form (TD Residual)

Recursive Computation

Hyperparameter λ

λ=0 (TD)

λ=1 (Monte Carlo)

λ∈(0,1) (Interpolation)

Intuition: Why the Weighted Combination?

Properties

1. Exponential Weighting

2. Consistency

3. Causality

4. Off-Policy Extension

Comparison with Alternatives

Algorithm: A3C with GAE

Empirical Performance

Related Hyperparameters

γ (Discount Factor)

Value Function Learning Rate

Entropy Coefficient (for entropy regularization)

Connections

Key References

Graph View

Table of Contents

Backlinks

Hyperparameter $λ$

$λ = 0$ (TD)

$λ = 1$ (Monte Carlo)

$λ \in (0, 1)$ (Interpolation)

$γ$ (Discount Factor)