Fisher Information
Definition
Fisher Information (Matrix)
The Fisher Information Matrix measures how much information an observable sample carries about the parameter of a probability distribution . In RL, is the policy, and quantifies the sensitivity of the policy distribution to changes in . It is defined as the expected outer product of the score (the gradient of the log-likelihood), and equivalently as the local curvature (Hessian) of the KL divergence around . It is the Riemannian metric tensor that turns vanilla gradients into natural gradients.
Intuition
A Ruler for Distribution Space
The vanilla gradient asks “which way moves the parameters fastest?” — but that depends entirely on how you happened to parameterize the policy ( vs vs give wildly different steps). The Fisher matrix re-scales the geometry so the question becomes “which way moves the policy distribution fastest?”, measured by KL divergence.
Think of as a curvature-aware ruler: along a direction where a tiny parameter nudge swings the action distribution a lot (high information), has a large eigenvalue, so the natural gradient takes a small step there; along a flat direction where parameters barely matter, is small and the step is large. This makes the resulting update invariant to reparameterization.
Mathematical Formulation
The score function is the gradient of the log-likelihood, . The Fisher Information Matrix is the covariance of the score (its mean is zero):
where:
- — Fisher Information Matrix, a matrix ( = number of policy parameters), symmetric and positive semi-definite
- — the score vector (also the log-derivative used in policy gradients)
- — the state distribution induced by following
- — expectation under the policy’s own samples (the outer product averages over actions drawn from )
Under regularity conditions, the score has zero mean, , so is exactly the covariance of the score. There is also an equivalent negative-Hessian form:
Connection to KL divergence — is the second-order (Hessian) term of the KL divergence between nearby policies:
where:
- — a small parameter displacement
- the first-order term vanishes because KL is minimized (zero) at , leaving as the local quadratic metric
This is precisely why appears in the Natural Policy Gradient: preconditioning by yields the steepest-ascent direction in distribution (KL) space rather than raw parameter space:
Key Properties / Variants
- Positive semi-definite & symmetric. Eigenvalues ; large eigenvalues mark “informative” directions where the policy is sensitive to .
- Reparameterization invariance. The natural gradient produces the same policy update under any smooth reparameterization, unlike the vanilla gradient.
- Curvature = information. is simultaneously (i) the score covariance, (ii) the negative expected Hessian of the log-likelihood, and (iii) the Hessian of KL divergence — three identities for the same object.
- Cramér–Rao link. In statistics, lower-bounds the variance of any unbiased estimator of ; high information ⇒ tighter achievable estimates.
- Computational cost. Forming and inverting explicitly is — prohibitive for neural policies. Practical schemes avoid this:
Algorithm: Fisher-Vector Product via Conjugate Gradient
─────────────────────────────────────────────────────────
Goal: compute natural gradient x = F^{-1} g without forming F
Input: vanilla gradient g = ∇_θ J(θ), damping λ
Define Fisher-vector product (no explicit F):
FVP(v):
# use the KL-Hessian identity: F v = ∇_θ ( (∇_θ D_KL)^T v )
kl ← mean KL( π_old || π_θ ) over sampled states
grad ← ∇_θ kl
return ∇_θ ( grad · v ) + λ·v # add damping λI for stability
# Solve F x = g iteratively (≈20 CG steps), never inverting F
x ← ConjugateGradient(FVP, b = g, iters = 20)
return x # x ≈ F^{-1} g = natural gradient direction- Approximations. Diagonal Fisher is but crude; K-FAC (Kronecker-factored) gives a structured, accurate approximation for neural nets; adding damping improves conditioning.
- Empirical vs true Fisher. The “true” Fisher samples actions ; the empirical Fisher uses the actions actually taken from data. They coincide only when the model fits the data well, and behave differently as optimizers.
Connections
- Preconditioner for: Natural Policy Gradient, Natural Gradient
- Built from: Log-derivative trick (the score function), Policy Gradient Theorem
- Geometry of: Steepest Descent in distribution space (vs parameter space)
- Underlies: Trust Region Policy Optimization (TRPO) (KL-constrained step), approximated away by PPO
- Contrast with: vanilla Gradient Ascent / Stochastic Gradient Descent (identity metric)
- Adaptive optimizers (Adam, AdaGrad) loosely mimic diagonal curvature scaling