Natural Policy Gradient (NPG)
Definition
Natural Policy Gradient is an improvement over vanilla policy gradient that accounts for the geometry of the policy distribution space. Instead of moving in the direction of steepest parameter change, it moves in the direction of steepest policy change (as measured by KL divergence).
Motivation: The Problem with Vanilla Gradients
Vanilla (Steepest Ascent in Parameter Space)
The standard policy gradient is:
Problem: The step size depends on how much a parameter change affects the policy itself.
Example: For different policy parameterizations (e.g., vs ), the same parameter change can cause vastly different policy changes.
Result: Training is unstable, hyperparameter-sensitive, and slow to converge.
Natural Gradient (Steepest Ascent in Policy Space)
Instead, move in the direction that achieves the steepest policy improvement:
where is the Fisher Information Matrix:
The Fisher Information Matrix
Definition
The Fisher Information Matrix (or Fisher Information in the policy gradient context) is:
Dimensions: matrix where = number of policy parameters.
Interpretation
- Positive semi-definite: Measures the curvature of the log-likelihood
- Large eigenvalues: Directions where small parameter changes cause large policy changes
- Small eigenvalues: Directions where large parameter changes cause small policy changes
Connection to KL Divergence
The Fisher Information Matrix is related to the Hessian of KL divergence:
This is the local quadratic approximation to KL divergence near .
Natural Gradient Update Rule
Standard Form
Interpretation
- : Direction of steepest ascent in parameter space
- : Rescaling matrix that converts parameter changes to policy changes
- Effect: Large steps in directions that barely change the policy, small steps in directions that dramatically change the policy
Practical Advantages
1. Parameterization Invariance
The natural gradient is invariant to how you parameterize the policy. Whether you use:
- vs vs
- Different neural network architectures
The natural gradient produces the same policy update (though different parameter updates).
Vanilla gradient: Different for each parameterization
Natural gradient: Same policy update regardless of parameterization
2. Better Convergence
- Converges faster (fewer iterations)
- More stable (less sensitive to hyperparameters)
- Larger effective step sizes
3. Automatic Step Size Scaling
The Fisher matrix automatically scales step sizes based on:
- How sensitive the policy is to parameter changes
- The variance in the gradient estimates
Computational Considerations
Computing the Fisher Matrix
Full approach (expensive):
- Compute for each sample
- Compute outer product: where
- Average:
- Invert: (requires operations for -dimensional parameters)
Computational cost: which is prohibitive for high-dimensional policies (neural networks).
Efficient Approximations
1. Conjugate Gradient Method
Instead of explicitly computing , solve the linear system:
using Conjugate Gradient (20-50 iterations typically suffice).
Cost: per update (much better than )
2. Diagonal Fisher
Approximate as diagonal:
where is element-wise multiplication.
Cost: (trivial to invert)
Trade-off: Less accurate but much faster
3. K-FAC (Kronecker-Factored Approximation)
For neural networks, factor the Fisher as a Kronecker product of smaller matrices.
Cost: Between and
Accuracy: Good approximation for neural networks
Natural Policy Gradient in Practice
Common Implementation: Trust Region Methods
TRPO (Trust Region Policy Optimization) uses the natural gradient with a trust region constraint:
The trust region constraint ensures the policy doesn’t change too much (limits KL divergence).
Solution: Use natural gradient with step size chosen to maintain the KL constraint.
Pseudocode
def natural_policy_gradient_update(trajectories, policy, value_fn, damping=0.01):
"""Compute natural policy gradient using conjugate gradient."""
# Compute vanilla gradient
grad = compute_policy_gradient(trajectories, policy, value_fn)
# Compute Fisher-vector products via Hessian-vector products
# (without explicitly forming the Fisher matrix)
def fisher_vector_product(v):
# Use autograd twice to compute H*v where H is Hessian of KL
return compute_hessian_vector_product(policy, v) + damping * v
# Solve: F x = grad using conjugate gradient
natural_grad = conjugate_gradient(fisher_vector_product, grad, iterations=20)
# Update
theta_new = theta + alpha * natural_grad
return theta_newComparison: Vanilla vs Natural Gradient
| Aspect | Vanilla Gradient | Natural Gradient |
|---|---|---|
| Update | ||
| Convergence | Slower | Faster |
| Stability | Parameter-dependent | Parameterization-invariant |
| Computation | to (depends on approximation) | |
| Hyperparameter sensitivity | High | Lower |
| Implementation | Simple | More complex |
Example: Gaussian Policy
Vanilla Gradient
For a diagonal Gaussian policy with mean and std :
Update each parameter separately with same learning rate.
Natural Gradient
The Fisher matrix (for Gaussian) has a natural block structure. The natural gradient:
- Scales updates by (covariance is important!)
- Scales updates by special factors
Effect: Automatically adjusts step sizes based on policy uncertainty.
Variants & Extensions
1. Trust Region Policy Optimization (TRPO)
- Constrains KL divergence between old and new policy
- Uses natural gradient to respect the constraint
- Sample-efficient, but slower to compute
2. PPO (Proximal Policy Optimization)
- Simpler approximation to TRPO
- Uses clipped objective instead of explicit trust region
- Much easier to implement, nearly as effective
3. A-Opt (Fisher-Damped)
- Adds damping term: (regularization)
- Improves stability and condition number
- Commonly used in practice
When to Use Natural Gradient
✓ Use when:
- Sample efficiency is critical
- Policy changes are large and destabilizing
- Working with trust region methods (TRPO)
- Parameter initialization is poor
✗ Don’t use when:
- Computational cost is prohibitive (without approximations)
- Standard gradient descent works well (simpler is better)
- Data is very noisy (first-order methods already struggle)
Connections
- Foundation for: Trust Region Policy Optimization (TRPO), PPO
- Related to: Policy Gradient Methods, Fisher Information
- Used in: Actor-Critic, Policy Gradient Theorem
- Appears in: Deep Reinforcement Learning, Optimization
Key References
-
Kakade, S. (2001). Natural Policy Gradients. NIPS.
- Original paper on natural policy gradients
-
Schulman, G., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.
- TRPO: Uses natural gradients with trust regions
-
Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms. Arxiv.
- PPO: Simpler approximation to natural gradient methods
-
Martens, J., & Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML.
- K-FAC: Efficient Fisher approximation for neural networks