Generalized Advantage Estimation (GAE)

Definition

Generalized Advantage Estimation is a method for estimating the advantage function that interpolates between single-step temporal difference (low bias, low variance) and Monte Carlo (unbiased, high variance) advantage estimates using a single hyperparameter .


Problem It Solves

The Bias-Variance Tradeoff

Different advantage estimators have different properties:

  1. 1-step TD advantage:

    • ✓ Low variance
    • ✗ High bias (only one step)
  2. n-step advantage:

    • Medium bias and variance
  3. Monte Carlo advantage:

    • ✓ Unbiased
    • ✗ High variance (depends on entire trajectory)

GAE’s solution: Use a weighted combination of all these, controlled by a single parameter.


Mathematical Formulation

GAE Definition

where is the -step advantage.

Equivalent Form (TD Residual)

GAE can be expressed as an exponential sum of temporal difference errors:

where is the TD residual (1-step advantage).

Recursive Computation

For efficient computation, compute advantages backwards through the trajectory:

with at the terminal state.

Pseudocode:

gae = 0
for t in reversed(range(T)):
    delta_t = rewards[t] + gamma * V(s_{t+1}) - V(s_t)
    gae = delta_t + gamma * lambda * gae
    advantages[t] = gae

Hyperparameter

The parameter controls the bias-variance tradeoff:

(TD)

  • Only uses immediate reward and next state value
  • Bias: High (only 1-step lookahead)
  • Variance: Low

(Monte Carlo)

  • Uses entire trajectory (full return )
  • Bias: Zero (unbiased)
  • Variance: High (depends on full episode)

(Interpolation)

Smooth tradeoff between bias and variance.

Common choices:

  • (slightly favor MC, good for most domains)
  • (more MC-like, less bias)
  • (balanced)

Intuition: Why the Weighted Combination?

The weights give exponentially decaying importance to longer TD chains:

  • 1-step: weight
  • 2-step: weight
  • 3-step: weight

Lower : Weight concentrated on short steps (low variance)
Higher : Weight spread across longer steps (less bias)


Properties

1. Exponential Weighting

The decay factor ensures:

  • Distant future terms contribute exponentially less
  • Numerical stability (sum converges)

2. Consistency

  • At : Consistent with 1-step TD
  • At : Consistent with MC return
  • At : Smooth interpolation

3. Causality

Each depends only on the current transition :

This respects causality: action at only affects future rewards.

4. Off-Policy Extension

GAE can be extended for off-policy learning using importance sampling:


Comparison with Alternatives

EstimatorFormulaBiasVariance
TD(0)0HighLow
TD()Trace-basedMediumMedium
GAE()TunableTunable
MC1NoneHigh

Algorithm: A3C with GAE

def compute_gae_advantages(trajectory, values, gamma=0.99, lambda=0.95):
    """Compute GAE advantages for a trajectory."""
    advantages = []
    gae = 0
    
    for t in reversed(range(len(trajectory))):
        state, action, reward, next_state, done = trajectory[t]
        value = values[state]
        next_value = 0 if done else values[next_state]
        
        # TD residual
        delta = reward + gamma * next_value - value
        
        # Exponential sum: A_t = delta_t + (gamma * lambda) * A_{t+1}
        gae = delta + (gamma * lambda) * gae
        advantages.insert(0, gae)
    
    return advantages
 
def policy_update(advantages, log_probs):
    """Update policy using GAE advantages."""
    policy_loss = -(torch.stack(log_probs) * advantages).mean()
    return policy_loss

Empirical Performance

GAE is widely used because:

Practical: Single hyperparameter controls tradeoff
Efficient: Backward pass is complexity
Flexible: Works with different value function approximators
Empirically strong: Consistently outperforms pure TD or MC

Typical results: With , GAE achieves:

  • Lower sample complexity than MC
  • Lower bias than TD
  • Faster convergence than either alone

When using GAE, also tune:

(Discount Factor)

  • Affects TD residual magnitude
  • Typically: for long horizons

Value Function Learning Rate

  • GAE quality depends on accurate estimates
  • Needs sufficient critic updates per policy update

Entropy Coefficient (for entropy regularization)

  • Can be paired with GAE in policy methods
  • Encourages exploration

Connections


Key References

  1. Schulman, G., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.

    • Original paper introducing GAE
  2. Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning (A3C). ICML.

    • Uses GAE in asynchronous policy gradient method
  3. Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms (PPO). Arxiv.

    • Standard method using GAE for advantage estimation