Compatible Function Approximation

Definition

Compatible Function Approximation

In an Actor-Critic method, the critic $\overset{q}{^}_{w} (s, a)$ is an approximation of the true $q^{π} (s, a)$ , so naively plugging it into the Policy Gradient Theorem introduces bias. Compatible function approximation specifies the conditions under which a learned critic can replace the true action-value function with no bias in the policy gradient. The critic must (1) be linear in the policy’s score features $\nabla_{θ} lo g π_{θ} (a ∣ s)$ and (2) be fit to minimise mean-squared error against $q^{π}$ . When both hold, $\nabla_{θ} J$ computed with $\overset{q}{^}_{w}$ exactly equals the one computed with $q^{π}$ .

Intuition

Why bias can vanish

The policy gradient only ever “sees” the critic through the inner product $\nabla_{θ} lo g π_{θ} (a ∣ s) \overset{q}{^}_{w} (s, a)$ . The critic does not need to be globally correct — it only needs to be correct in the subspace spanned by the score functions $\nabla_{θ} lo g π_{θ}$ .

If the critic is linear in exactly those score features, then any approximation error is, by construction, orthogonal to the score functions. Orthogonal error contributes nothing to the projected inner product, so it cancels in expectation. This is the same orthogonality logic that makes least-squares residuals perpendicular to the regression features — here the “features” are the policy’s own score functions.

Mathematical Formulation

A critic $\overset{q}{^}_{w} (s, a)$ is compatible with policy $π_{θ}$ if it satisfies two conditions.

Condition 1 — Compatible features (the gradient must match): $\nabla_{w} \overset{q}{^}_{w} (s, a) = \nabla_{θ} lo g π_{θ} (a ∣ s)$

This forces a linear critic in the score features: $\overset{q}{^}_{w} (s, a) = w^{⊤} \nabla_{θ} lo g π_{θ} (a ∣ s)$

Condition 2 — Critic minimises mean-squared error against the true value: $ε (w) = E_{s \sim μ^{π}, a \sim π_{θ}} [(q^{π} (s, a) - \overset{q}{^}_{w} (s, a))^{2}], w^{⋆} = ar g min_{w} ε (w)$

Result — unbiased policy gradient. If both hold, then $\nabla_{θ} J (θ) = E_{s \sim μ^{π}, a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \overset{q}{^}_{w^{⋆}} (s, a)] = E [\nabla_{θ} lo g π_{θ} (a ∣ s) q^{π} (s, a)]$

where:

$\overset{q}{^}_{w} (s, a)$ — the parametrised critic, parameters $w$
$\nabla_{θ} lo g π_{θ} (a ∣ s)$ — the score function of the policy (the Log-derivative trick term that appears in every policy gradient)
$μ^{π}$ — the On-Policy Distribution of states under $π_{θ}$
$q^{π} (s, a)$ — the true action-value function being approximated
$w^{⋆}$ — critic weights at the MSE minimum

Why it works (proof sketch). At the minimiser $w^{⋆}$ , the gradient of the MSE is zero: $\nabla_{w} ε (w^{⋆}) = E [(q^{π} - \overset{q}{^}_{w^{⋆}}) \nabla_{w} \overset{q}{^}_{w^{⋆}} (s, a)] = 0$ Substituting Condition 1, $\nabla_{w} \overset{q}{^}_{w^{⋆}} = \nabla_{θ} lo g π_{θ}$ , gives $E [(q^{π} - \overset{q}{^}_{w^{⋆}}) \nabla_{θ} lo g π_{θ} (a ∣ s)] = 0$ i.e. the approximation error is orthogonal to the score functions. Adding this zero to the gradient lets us swap $q^{π} \to \overset{q}{^}_{w^{⋆}}$ with no change.

Key Properties / Variants

Two conditions, both required: linear-in-score-features critic and MSE-optimal weights. Drop either and the substitution is biased.
Subspace, not global, accuracy: the critic need not approximate $q^{π}$ well everywhere — only its projection onto the score-function subspace matters.
Baselines are free under compatibility: subtracting any state-dependent Baseline $b (s)$ (e.g. $\overset{v}{^} (s)$ ) leaves the gradient unbiased because $E_{a} [\nabla_{θ} lo g π_{θ} (a ∣ s) b (s)] = 0$ . Combining a compatible critic with a value baseline yields an unbiased Advantage Function estimate $\hat{A} (s, a) = \overset{q}{^}_{w} (s, a) - b (s)$ .
Limited expressiveness: a critic linear in score features is weak. In practice (deep actor-critic, A2C/A3C/PPO) the compatibility conditions are relaxed — a nonlinear neural critic is used, trading exact unbiasedness for representational power. Compatible FA is mainly the theoretical guarantee that an unbiased actor-critic can exist.
Connection to Natural Policy Gradient: with a compatible critic $\overset{q}{^}_{w} = w^{⊤} \nabla_{θ} lo g π_{θ}$ , the MSE-optimal weights $w^{⋆}$ are exactly the Natural Policy Gradient direction, $w^{⋆} = F^{- 1} \nabla_{θ} J$ , where $F$ is the Fisher Information Matrix. So the compatible critic’s parameters are the natural gradient.

Algorithm: Compatible Actor-Critic (one update)
────────────────────────────────────────────────
Given policy π_θ, compatible critic q̂_w(s,a) = wᵀ ∇_θ log π_θ(a|s)
 
Loop:
  Sample s ~ μ^π, a ~ π_θ(·|s); observe target for q^π(s,a)
  # Critic step: drive MSE to its minimum
  feat ← ∇_θ log π_θ(a|s)          # compatible features
  err  ← q^π_target(s,a) − wᵀ feat
  w    ← w + β · err · feat         # ⇒ at convergence, error ⟂ feat
  # Actor step: unbiased because critic is compatible + MSE-optimal
  θ    ← θ + α · feat · (wᵀ feat)   # = ∇_θ log π_θ · q̂_w

Connections

Makes unbiased: Policy Gradient Theorem, Actor-Critic
Critic feature is the: Log-derivative trick score function $\nabla_{θ} lo g π_{θ}$
Equivalent weights to: Natural Policy Gradient via Fisher Information Matrix
Pairs with: Baseline, Advantage Function
State weighting: On-Policy Distribution
Relaxed by deep methods: A2C, A3C, Proximal Policy Optimization
Foundational for: REINFORCE (the special case with no critic)

Appears In

RL-L10 - Advanced Policy Search

Study Notes

Explorer

Compatible Function Approximation

Compatible Function Approximation

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks