Compatible Function Approximation

Definition

Compatible Function Approximation

In an Actor-Critic method, the critic is an approximation of the true , so naively plugging it into the Policy Gradient Theorem introduces bias. Compatible function approximation specifies the conditions under which a learned critic can replace the true action-value function with no bias in the policy gradient. The critic must (1) be linear in the policy’s score features and (2) be fit to minimise mean-squared error against . When both hold, computed with exactly equals the one computed with .

Intuition

Why bias can vanish

The policy gradient only ever “sees” the critic through the inner product . The critic does not need to be globally correct — it only needs to be correct in the subspace spanned by the score functions .

If the critic is linear in exactly those score features, then any approximation error is, by construction, orthogonal to the score functions. Orthogonal error contributes nothing to the projected inner product, so it cancels in expectation. This is the same orthogonality logic that makes least-squares residuals perpendicular to the regression features — here the “features” are the policy’s own score functions.

Mathematical Formulation

A critic is compatible with policy if it satisfies two conditions.

Condition 1 — Compatible features (the gradient must match):

This forces a linear critic in the score features:

Condition 2 — Critic minimises mean-squared error against the true value:

Result — unbiased policy gradient. If both hold, then

where:

  • — the parametrised critic, parameters
  • — the score function of the policy (the Log-derivative trick term that appears in every policy gradient)
  • — the On-Policy Distribution of states under
  • — the true action-value function being approximated
  • — critic weights at the MSE minimum

Why it works (proof sketch). At the minimiser , the gradient of the MSE is zero: Substituting Condition 1, , gives i.e. the approximation error is orthogonal to the score functions. Adding this zero to the gradient lets us swap with no change.

Key Properties / Variants

  • Two conditions, both required: linear-in-score-features critic and MSE-optimal weights. Drop either and the substitution is biased.
  • Subspace, not global, accuracy: the critic need not approximate well everywhere — only its projection onto the score-function subspace matters.
  • Baselines are free under compatibility: subtracting any state-dependent Baseline (e.g. ) leaves the gradient unbiased because . Combining a compatible critic with a value baseline yields an unbiased Advantage Function estimate .
  • Limited expressiveness: a critic linear in score features is weak. In practice (deep actor-critic, A2C/A3C/PPO) the compatibility conditions are relaxed — a nonlinear neural critic is used, trading exact unbiasedness for representational power. Compatible FA is mainly the theoretical guarantee that an unbiased actor-critic can exist.
  • Connection to Natural Policy Gradient: with a compatible critic , the MSE-optimal weights are exactly the Natural Policy Gradient direction, , where is the Fisher Information Matrix. So the compatible critic’s parameters are the natural gradient.
Algorithm: Compatible Actor-Critic (one update)
────────────────────────────────────────────────
Given policy π_θ, compatible critic q̂_w(s,a) = wᵀ ∇_θ log π_θ(a|s)
 
Loop:
  Sample s ~ μ^π, a ~ π_θ(·|s); observe target for q^π(s,a)
  # Critic step: drive MSE to its minimum
  feat ← ∇_θ log π_θ(a|s)          # compatible features
  err  ← q^π_target(s,a) − wᵀ feat
  w    ← w + β · err · feat         # ⇒ at convergence, error ⟂ feat
  # Actor step: unbiased because critic is compatible + MSE-optimal
  θ    ← θ + α · feat · (wᵀ feat)   # = ∇_θ log π_θ · q̂_w

Connections

Appears In