Generalized Advantage Estimation (GAE)
Definition
Generalized Advantage Estimation is a method for estimating the advantage function that interpolates between single-step temporal difference (low bias, low variance) and Monte Carlo (unbiased, high variance) advantage estimates using a single hyperparameter .
Problem It Solves
The Bias-Variance Tradeoff
Different advantage estimators have different properties:
-
1-step TD advantage:
- ✓ Low variance
- ✗ High bias (only one step)
-
n-step advantage:
- Medium bias and variance
-
Monte Carlo advantage:
- ✓ Unbiased
- ✗ High variance (depends on entire trajectory)
GAE’s solution: Use a weighted combination of all these, controlled by a single parameter.
Mathematical Formulation
GAE Definition
where is the -step advantage.
Equivalent Form (TD Residual)
GAE can be expressed as an exponential sum of temporal difference errors:
where is the TD residual (1-step advantage).
Recursive Computation
For efficient computation, compute advantages backwards through the trajectory:
with at the terminal state.
Pseudocode:
gae = 0
for t in reversed(range(T)):
delta_t = rewards[t] + gamma * V(s_{t+1}) - V(s_t)
gae = delta_t + gamma * lambda * gae
advantages[t] = gae
Hyperparameter
The parameter controls the bias-variance tradeoff:
(TD)
- Only uses immediate reward and next state value
- Bias: High (only 1-step lookahead)
- Variance: Low
(Monte Carlo)
- Uses entire trajectory (full return )
- Bias: Zero (unbiased)
- Variance: High (depends on full episode)
(Interpolation)
Smooth tradeoff between bias and variance.
Common choices:
- (slightly favor MC, good for most domains)
- (more MC-like, less bias)
- (balanced)
Intuition: Why the Weighted Combination?
The weights give exponentially decaying importance to longer TD chains:
- 1-step: weight
- 2-step: weight
- 3-step: weight
- …
Lower : Weight concentrated on short steps (low variance)
Higher : Weight spread across longer steps (less bias)
Properties
1. Exponential Weighting
The decay factor ensures:
- Distant future terms contribute exponentially less
- Numerical stability (sum converges)
2. Consistency
- At : Consistent with 1-step TD
- At : Consistent with MC return
- At : Smooth interpolation
3. Causality
Each depends only on the current transition :
This respects causality: action at only affects future rewards.
4. Off-Policy Extension
GAE can be extended for off-policy learning using importance sampling:
Comparison with Alternatives
| Estimator | Formula | Bias | Variance | |
|---|---|---|---|---|
| TD(0) | 0 | High | Low | |
| TD() | Trace-based | Medium | Medium | |
| GAE() | Tunable | Tunable | ||
| MC | 1 | None | High |
Algorithm: A3C with GAE
def compute_gae_advantages(trajectory, values, gamma=0.99, lambda=0.95):
"""Compute GAE advantages for a trajectory."""
advantages = []
gae = 0
for t in reversed(range(len(trajectory))):
state, action, reward, next_state, done = trajectory[t]
value = values[state]
next_value = 0 if done else values[next_state]
# TD residual
delta = reward + gamma * next_value - value
# Exponential sum: A_t = delta_t + (gamma * lambda) * A_{t+1}
gae = delta + (gamma * lambda) * gae
advantages.insert(0, gae)
return advantages
def policy_update(advantages, log_probs):
"""Update policy using GAE advantages."""
policy_loss = -(torch.stack(log_probs) * advantages).mean()
return policy_lossEmpirical Performance
GAE is widely used because:
✓ Practical: Single hyperparameter controls tradeoff
✓ Efficient: Backward pass is complexity
✓ Flexible: Works with different value function approximators
✓ Empirically strong: Consistently outperforms pure TD or MC
Typical results: With , GAE achieves:
- Lower sample complexity than MC
- Lower bias than TD
- Faster convergence than either alone
Related Hyperparameters
When using GAE, also tune:
(Discount Factor)
- Affects TD residual magnitude
- Typically: for long horizons
Value Function Learning Rate
- GAE quality depends on accurate estimates
- Needs sufficient critic updates per policy update
Entropy Coefficient (for entropy regularization)
- Can be paired with GAE in policy methods
- Encourages exploration
Connections
- Extends: Advantage Function, Temporal Difference Learning
- Used in: A3C, A2C, PPO, TRPO, SAC
- Related to: Bias-Variance Trade-off, Value Function
- Appears in: Actor-Critic, Policy Gradient Methods
Key References
-
Schulman, G., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.
- Original paper introducing GAE
-
Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning (A3C). ICML.
- Uses GAE in asynchronous policy gradient method
-
Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms (PPO). Arxiv.
- Standard method using GAE for advantage estimation