Natural Policy Gradient (NPG)

Definition

Natural Policy Gradient is an improvement over vanilla policy gradient that accounts for the geometry of the policy distribution space. Instead of moving in the direction of steepest parameter change, it moves in the direction of steepest policy change (as measured by KL divergence).


Motivation: The Problem with Vanilla Gradients

Vanilla (Steepest Ascent in Parameter Space)

The standard policy gradient is:

Problem: The step size depends on how much a parameter change affects the policy itself.

Example: For different policy parameterizations (e.g., vs ), the same parameter change can cause vastly different policy changes.

Result: Training is unstable, hyperparameter-sensitive, and slow to converge.

Natural Gradient (Steepest Ascent in Policy Space)

Instead, move in the direction that achieves the steepest policy improvement:

where is the Fisher Information Matrix:


The Fisher Information Matrix

Definition

The Fisher Information Matrix (or Fisher Information in the policy gradient context) is:

Dimensions: matrix where = number of policy parameters.

Interpretation

  • Positive semi-definite: Measures the curvature of the log-likelihood
  • Large eigenvalues: Directions where small parameter changes cause large policy changes
  • Small eigenvalues: Directions where large parameter changes cause small policy changes

Connection to KL Divergence

The Fisher Information Matrix is related to the Hessian of KL divergence:

This is the local quadratic approximation to KL divergence near .


Natural Gradient Update Rule

Standard Form

Interpretation

  • : Direction of steepest ascent in parameter space
  • : Rescaling matrix that converts parameter changes to policy changes
  • Effect: Large steps in directions that barely change the policy, small steps in directions that dramatically change the policy

Practical Advantages

1. Parameterization Invariance

The natural gradient is invariant to how you parameterize the policy. Whether you use:

  • vs vs
  • Different neural network architectures

The natural gradient produces the same policy update (though different parameter updates).

Vanilla gradient: Different for each parameterization
Natural gradient: Same policy update regardless of parameterization

2. Better Convergence

  • Converges faster (fewer iterations)
  • More stable (less sensitive to hyperparameters)
  • Larger effective step sizes

3. Automatic Step Size Scaling

The Fisher matrix automatically scales step sizes based on:

  • How sensitive the policy is to parameter changes
  • The variance in the gradient estimates

Computational Considerations

Computing the Fisher Matrix

Full approach (expensive):

  1. Compute for each sample
  2. Compute outer product: where
  3. Average:
  4. Invert: (requires operations for -dimensional parameters)

Computational cost: which is prohibitive for high-dimensional policies (neural networks).

Efficient Approximations

1. Conjugate Gradient Method

Instead of explicitly computing , solve the linear system:

using Conjugate Gradient (20-50 iterations typically suffice).

Cost: per update (much better than )

2. Diagonal Fisher

Approximate as diagonal:

where is element-wise multiplication.

Cost: (trivial to invert)
Trade-off: Less accurate but much faster

3. K-FAC (Kronecker-Factored Approximation)

For neural networks, factor the Fisher as a Kronecker product of smaller matrices.

Cost: Between and
Accuracy: Good approximation for neural networks


Natural Policy Gradient in Practice

Common Implementation: Trust Region Methods

TRPO (Trust Region Policy Optimization) uses the natural gradient with a trust region constraint:

The trust region constraint ensures the policy doesn’t change too much (limits KL divergence).

Solution: Use natural gradient with step size chosen to maintain the KL constraint.

Pseudocode

def natural_policy_gradient_update(trajectories, policy, value_fn, damping=0.01):
    """Compute natural policy gradient using conjugate gradient."""
    
    # Compute vanilla gradient
    grad = compute_policy_gradient(trajectories, policy, value_fn)
    
    # Compute Fisher-vector products via Hessian-vector products
    # (without explicitly forming the Fisher matrix)
    def fisher_vector_product(v):
        # Use autograd twice to compute H*v where H is Hessian of KL
        return compute_hessian_vector_product(policy, v) + damping * v
    
    # Solve: F x = grad using conjugate gradient
    natural_grad = conjugate_gradient(fisher_vector_product, grad, iterations=20)
    
    # Update
    theta_new = theta + alpha * natural_grad
    return theta_new

Comparison: Vanilla vs Natural Gradient

AspectVanilla GradientNatural Gradient
Update
ConvergenceSlowerFaster
StabilityParameter-dependentParameterization-invariant
Computation to (depends on approximation)
Hyperparameter sensitivityHighLower
ImplementationSimpleMore complex

Example: Gaussian Policy

Vanilla Gradient

For a diagonal Gaussian policy with mean and std :

Update each parameter separately with same learning rate.

Natural Gradient

The Fisher matrix (for Gaussian) has a natural block structure. The natural gradient:

  • Scales updates by (covariance is important!)
  • Scales updates by special factors

Effect: Automatically adjusts step sizes based on policy uncertainty.


Variants & Extensions

1. Trust Region Policy Optimization (TRPO)

  • Constrains KL divergence between old and new policy
  • Uses natural gradient to respect the constraint
  • Sample-efficient, but slower to compute

2. PPO (Proximal Policy Optimization)

  • Simpler approximation to TRPO
  • Uses clipped objective instead of explicit trust region
  • Much easier to implement, nearly as effective

3. A-Opt (Fisher-Damped)

  • Adds damping term: (regularization)
  • Improves stability and condition number
  • Commonly used in practice

When to Use Natural Gradient

✓ Use when:

  • Sample efficiency is critical
  • Policy changes are large and destabilizing
  • Working with trust region methods (TRPO)
  • Parameter initialization is poor

✗ Don’t use when:

  • Computational cost is prohibitive (without approximations)
  • Standard gradient descent works well (simpler is better)
  • Data is very noisy (first-order methods already struggle)

Connections


Key References

  1. Kakade, S. (2001). Natural Policy Gradients. NIPS.

    • Original paper on natural policy gradients
  2. Schulman, G., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.

    • TRPO: Uses natural gradients with trust regions
  3. Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms. Arxiv.

    • PPO: Simpler approximation to natural gradient methods
  4. Martens, J., & Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML.

    • K-FAC: Efficient Fisher approximation for neural networks