Natural Policy Gradient (NPG)

Definition

Natural Policy Gradient is an improvement over vanilla policy gradient that accounts for the geometry of the policy distribution space. Instead of moving in the direction of steepest parameter change, it moves in the direction of steepest policy change (as measured by KL divergence).

Motivation: The Problem with Vanilla Gradients

Vanilla (Steepest Ascent in Parameter Space)

The standard policy gradient is: $θ_{t + 1} = θ_{t} + α \nabla_{θ} J (θ)$

Problem: The step size depends on how much a parameter change affects the policy itself.

Example: For different policy parameterizations (e.g., $lo g σ$ vs $σ^{2}$ ), the same parameter change can cause vastly different policy changes.

Result: Training is unstable, hyperparameter-sensitive, and slow to converge.

Natural Gradient (Steepest Ascent in Policy Space)

Instead, move in the direction that achieves the steepest policy improvement:

$\tilde{\nabla}_{θ} J = F^{- 1} (θ) \nabla_{θ} J (θ)$

where $F (θ)$ is the Fisher Information Matrix:

$F (θ) = E_{a \sim π_{θ} (a ∣ s)} [\nabla_{θ} lo g π_{θ} (a ∣ s) \nabla_{θ} lo g π_{θ} (a ∣ s)^{T}]$

The Fisher Information Matrix

Definition

The Fisher Information Matrix (or Fisher Information in the policy gradient context) is:

$F (θ) = E_{s, a} [\nabla_{θ} lo g π_{θ} (a ∣ s) \nabla_{θ} lo g π_{θ} (a ∣ s)^{T}]$

Dimensions: $d \times d$ matrix where $d$ = number of policy parameters.

Interpretation

Positive semi-definite: Measures the curvature of the log-likelihood
Large eigenvalues: Directions where small parameter changes cause large policy changes
Small eigenvalues: Directions where large parameter changes cause small policy changes

Connection to KL Divergence

The Fisher Information Matrix is related to the Hessian of KL divergence:

$H_{KL} (π_{θ} ∥ π_{θ^{'}}) \approx \frac{1}{2} (θ - θ^{'})^{T} F (θ) (θ - θ^{'})$

This is the local quadratic approximation to KL divergence near $θ$ .

Natural Gradient Update Rule

Standard Form

$θ_{t + 1} = θ_{t} + α F^{- 1} (θ_{t}) \nabla_{θ} J (θ_{t})$

Interpretation

$\nabla_{θ} J (θ_{t})$ : Direction of steepest ascent in parameter space
$F^{- 1} (θ_{t})$ : Rescaling matrix that converts parameter changes to policy changes
Effect: Large steps in directions that barely change the policy, small steps in directions that dramatically change the policy

Practical Advantages

1. Parameterization Invariance

The natural gradient is invariant to how you parameterize the policy. Whether you use:

$lo g σ$ vs $σ$ vs $σ^{2}$
Different neural network architectures

The natural gradient produces the same policy update (though different parameter updates).

Vanilla gradient: Different for each parameterization
Natural gradient: Same policy update regardless of parameterization

2. Better Convergence

Converges faster (fewer iterations)
More stable (less sensitive to hyperparameters)
Larger effective step sizes

3. Automatic Step Size Scaling

The Fisher matrix automatically scales step sizes based on:

How sensitive the policy is to parameter changes
The variance in the gradient estimates

Computational Considerations

Computing the Fisher Matrix

Full approach (expensive):

Compute $\nabla_{θ} lo g π (a ∣ s)$ for each sample
Compute outer product: $g_{i} g_{i}^{T}$ where $g_{i} = \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i})$
Average: $F = E [g g^{T}]$
Invert: $F^{- 1}$ (requires $d^{3}$ operations for $d$ -dimensional parameters)

Computational cost: $O (d^{3})$ which is prohibitive for high-dimensional policies (neural networks).

Efficient Approximations

1. Conjugate Gradient Method

Instead of explicitly computing $F^{- 1}$ , solve the linear system: $F x = \nabla_{θ} J (θ)$

using Conjugate Gradient (20-50 iterations typically suffice).

Cost: $O (d^{2})$ per update (much better than $O (d^{3})$ )

2. Diagonal Fisher

Approximate $F$ as diagonal: $F \approx diag (E [g ⊙ g])$

where $⊙$ is element-wise multiplication.

Cost: $O (d)$ (trivial to invert)
Trade-off: Less accurate but much faster

3. K-FAC (Kronecker-Factored Approximation)

For neural networks, factor the Fisher as a Kronecker product of smaller matrices.

Cost: Between $O (d)$ and $O (d^{3})$
Accuracy: Good approximation for neural networks

Natural Policy Gradient in Practice

Common Implementation: Trust Region Methods

TRPO (Trust Region Policy Optimization) uses the natural gradient with a trust region constraint:

$max_{θ} E [\nabla_{θ} lo g π (a ∣ s) Q (s, a)]$ $subject to: E_{s} [D_{KL} (π_{old} ∣∣ π_{θ})] \leq δ$

The trust region constraint ensures the policy doesn’t change too much (limits KL divergence).

Solution: Use natural gradient with step size chosen to maintain the KL constraint.

Pseudocode

def natural_policy_gradient_update(trajectories, policy, value_fn, damping=0.01):
    """Compute natural policy gradient using conjugate gradient."""
    
    # Compute vanilla gradient
    grad = compute_policy_gradient(trajectories, policy, value_fn)
    
    # Compute Fisher-vector products via Hessian-vector products
    # (without explicitly forming the Fisher matrix)
    def fisher_vector_product(v):
        # Use autograd twice to compute H*v where H is Hessian of KL
        return compute_hessian_vector_product(policy, v) + damping * v
    
    # Solve: F x = grad using conjugate gradient
    natural_grad = conjugate_gradient(fisher_vector_product, grad, iterations=20)
    
    # Update
    theta_new = theta + alpha * natural_grad
    return theta_new

Comparison: Vanilla vs Natural Gradient

Aspect	Vanilla Gradient	Natural Gradient
Update	$\nabla J (θ)$	$F^{- 1} \nabla J (θ)$
Convergence	Slower	Faster
Stability	Parameter-dependent	Parameterization-invariant
Computation	$O (d)$	$O (d)$ to $O (d^{3})$ (depends on approximation)
Hyperparameter sensitivity	High	Lower
Implementation	Simple	More complex

Example: Gaussian Policy

Vanilla Gradient

For a diagonal Gaussian policy with mean $μ$ and std $σ$ :

$\nabla_{μ} J = E [\nabla_{μ} lo g π (a ∣ s) Q (s, a)]$ $\nabla_{σ} J = E [\nabla_{σ} lo g π (a ∣ s) Q (s, a)]$

Update each parameter separately with same learning rate.

Natural Gradient

The Fisher matrix (for Gaussian) has a natural block structure. The natural gradient:

Scales $μ$ updates by $1/ σ^{2}$ (covariance is important!)
Scales $σ$ updates by special factors

Effect: Automatically adjusts step sizes based on policy uncertainty.

Variants & Extensions

1. Trust Region Policy Optimization (TRPO)

Constrains KL divergence between old and new policy
Uses natural gradient to respect the constraint
Sample-efficient, but slower to compute

2. PPO (Proximal Policy Optimization)

Simpler approximation to TRPO
Uses clipped objective instead of explicit trust region
Much easier to implement, nearly as effective

3. A-Opt (Fisher-Damped)

Adds damping term: $F + λ I$ (regularization)
Improves stability and condition number
Commonly used in practice

When to Use Natural Gradient

✓ Use when:

Sample efficiency is critical
Policy changes are large and destabilizing
Working with trust region methods (TRPO)
Parameter initialization is poor

✗ Don’t use when:

Computational cost is prohibitive (without approximations)
Standard gradient descent works well (simpler is better)
Data is very noisy (first-order methods already struggle)

Connections

Foundation for: Trust Region Policy Optimization (TRPO), PPO
Related to: Policy Gradient Methods, Fisher Information
Used in: Actor-Critic, Policy Gradient Theorem
Appears in: Deep Reinforcement Learning, Optimization

Key References

Kakade, S. (2001). Natural Policy Gradients. NIPS.
- Original paper on natural policy gradients
Schulman, G., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.
- TRPO: Uses natural gradients with trust regions
Schulman, G., et al. (2017). Proximal Policy Optimization Algorithms. Arxiv.
- PPO: Simpler approximation to natural gradient methods
Martens, J., & Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML.
- K-FAC: Efficient Fisher approximation for neural networks

Study Notes

Explorer

Natural Policy Gradient

Natural Policy Gradient (NPG)

Definition

Motivation: The Problem with Vanilla Gradients

Vanilla (Steepest Ascent in Parameter Space)

Natural Gradient (Steepest Ascent in Policy Space)

The Fisher Information Matrix

Definition

Interpretation

Connection to KL Divergence

Natural Gradient Update Rule

Standard Form

Interpretation

Practical Advantages

1. Parameterization Invariance

2. Better Convergence

3. Automatic Step Size Scaling

Computational Considerations

Computing the Fisher Matrix

Efficient Approximations

1. Conjugate Gradient Method

2. Diagonal Fisher

3. K-FAC (Kronecker-Factored Approximation)

Natural Policy Gradient in Practice

Common Implementation: Trust Region Methods

Pseudocode

Comparison: Vanilla vs Natural Gradient

Example: Gaussian Policy

Vanilla Gradient

Natural Gradient

Variants & Extensions

1. Trust Region Policy Optimization (TRPO)

2. PPO (Proximal Policy Optimization)

3. A-Opt (Fisher-Damped)

When to Use Natural Gradient

✓ Use when:

✗ Don’t use when:

Connections

Key References

Graph View

Table of Contents

Backlinks