Trust Region Policy Optimization (TRPO)

Definition

Trust Region Policy Optimization

TRPO is an on-policy policy gradient algorithm that improves the policy by maximizing a surrogate objective (the expected advantage under the new policy, weighted by an importance ratio) subject to a hard constraint that the new policy stays close to the old one in average KL divergence. The KL constraint defines a “trust region” inside which the surrogate is a reliable approximation of the true performance, guaranteeing monotonic (non-decreasing) policy improvement in the idealized case.

Intuition

Vanilla policy gradient takes a fixed-size step in parameter space ( $θ \leftarrow θ + α \nabla_{θ} J$ ). The problem: a small change in $θ$ can cause a huge change in the actual policy distribution, collapsing performance. Because policy gradient is on-policy, one bad update poisons all future data — there is no recovery from off-policy replay.

TRPO fixes this by measuring step size in policy space instead of parameter space. It asks: “How far can I move the policy and still trust my data-derived estimate of improvement?” The answer is bounded by the KL divergence between old and new policy. Stay inside the trust region $\overline{D}_{KL} \leq δ$ , and the surrogate objective faithfully predicts the true return; the update is then guaranteed not to make things worse.

Why a Constraint Instead of a Penalty

Theory (the MM / minorize-maximize bound) actually suggests a KL penalty with a coefficient tied to the max advantage. But that coefficient forces tiny, conservative steps. TRPO instead uses a hard KL constraint $δ$ as a robust, tunable proxy that allows much larger practical steps while keeping the trust-region guarantee.

Mathematical Formulation

The surrogate objective. TRPO maximizes the expected advantage of the new policy $π_{θ}$ , with state visitation taken from the old policy $π_{θ_{old}}$ and actions reweighted by an importance ratio:

$L_{θ_{old}} (θ) = E_{s \sim ρ^{π_{θ_{old}}}, a \sim π_{θ_{old}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A^{π_{θ_{old}}} (s, a)]$

The constrained optimization problem solved at each iteration:

$max_{θ} L_{θ_{old}} (θ) subject to \overline{D}_{KL} (π_{θ_{old}} ∥ π_{θ}) \leq δ$

where:

$\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )}$ — importance ratio correcting for the fact that data was collected under $π_{θ_{old}}$ but we score $π_{θ}$
$A^{π_{θ_{old}}} (s, a)$ — advantage under the old policy (estimated in practice by GAE)
$ρ^{π_{θ_{old}}}$ — discounted state-visitation distribution under the old policy
$\overline{D}_{KL} (π_{θ_{old}} ∥ π_{θ}) = E_{s \sim ρ^{π_{θ_{old}}}} [D_{KL} (π_{θ_{old}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))]$ — average KL over visited states (the max-KL version is intractable)
$δ$ — trust-region radius (typical: $δ \approx 1 0^{- 2}$ )

Solving it (the natural-gradient connection). Linearize the objective and quadratically approximate the constraint around $θ_{old}$ :

$max_{Δ θ} g^{⊤} Δ θ s.t. \frac{1}{2} Δ θ^{⊤} F Δ θ \leq δ$

where $g = \nabla_{θ} L_{θ_{old}} (θ) ∣_{θ_{old}}$ is the policy gradient and $F$ is the Fisher Information Matrix (the Hessian of the KL constraint). The closed-form solution is the natural gradient step scaled to fill the trust region:

$θ_{new} = θ_{old} + \frac{2 δ}{g ^{⊤} F ^{- 1} g} F^{- 1} g$

where:

$F^{- 1} g$ — the natural gradient direction (search direction $x$ obtained by solving $F x = g$ with conjugate gradient, never forming $F$ explicitly)
$2 δ / (g^{⊤} F^{- 1} g)$ — maximal step size along $F^{- 1} g$ that keeps the quadratic KL at exactly $δ$

Monotonic Improvement Bound

TRPO is derived from a guarantee on the true return $η$ : $η (π_{θ}) \geq L_{θ_{old}} (θ) - C \cdot \overline{D}_{KL}^{m a x} (π_{θ_{old}} ∥ π_{θ}), C = \frac{4 ε γ}{( 1 - γ ) ^{2}}$ with $ε = max_{s, a} ∣ A^{π_{θ_{old}}} (s, a) ∣$ . Maximizing the right-hand side (a minorant of the true return) cannot decrease $η$ — this is the monotonic-improvement property. TRPO replaces the penalty $C \overline{D}_{KL}$ with the trust-region constraint $δ$ for larger, practical steps.

Key Properties / Variants

On-policy. Fresh trajectories are collected from $π_{θ_{old}}$ each iteration; the importance ratio is only a local correction valid near $θ_{old}$ (inside the trust region).
Second-order / natural-gradient method. The Fisher matrix $F$ is the curvature of policy space; TRPO is essentially Natural Policy Gradient with an automatically chosen step size plus safeguards.
Hessian-free. Conjugate gradient solves $F x = g$ using only Fisher-vector products ( $F v$ via a double back-prop through the KL), avoiding the $O (d^{3})$ cost of inverting $F$ .
Backtracking line search enforces the true (non-approximated) constraint and a real improvement in the surrogate — protecting against errors from the quadratic approximation.
Monotonic improvement guarantee (in the exact setting), unlike vanilla policy gradient which can collapse.
Cost / drawbacks: the CG + line search machinery is complex and per-update expensive; it does not naturally share parameters between policy and value networks, and is awkward with architectures involving heavy noise/dropout.
Successor — Proximal Policy Optimization: replaces the hard KL constraint with a clipped surrogate ratio (or a soft KL penalty), recovering most of TRPO’s stability with first-order SGD and far simpler code. PPO is now the default; TRPO is the theoretical anchor.

Algorithm: Trust Region Policy Optimization (TRPO)
──────────────────────────────────────────────────
Input: initial policy θ, value/critic params w, trust radius δ
Loop for each iteration:
  1. Collect trajectories by running π_θ_old (θ_old ← θ) in the environment
  2. Estimate advantages  Â_t   (e.g., GAE with the critic V_w)
  3. Compute policy gradient   g = ∇_θ L(θ) |_θ_old
       L(θ) = mean_t [ (π_θ(a_t|s_t) / π_θ_old(a_t|s_t)) · Â_t ]
  4. Solve  F x = g  by conjugate gradient    (x ≈ F⁻¹ g, natural-grad direction)
       using Fisher-vector products  F v = ∇_θ ( (∇_θ KL)·v )   (no explicit F)
  5. Compute max step:  β = sqrt( 2δ / (xᵀ F x) )
  6. Backtracking line search over j = 0,1,2,...:
       θ_try = θ_old + (α^j) · β · x          (α ∈ (0,1), e.g. 0.5)
       accept first θ_try with  KL(π_θ_old ‖ π_θ_try) ≤ δ
                               and  L(θ_try) > L(θ_old)
       θ ← θ_try   (if none accepted, keep θ_old)
  7. Fit critic: minimize  Σ_t ( V_w(s_t) − R̂_t )²

Connections

Builds directly on: Natural Policy Gradient, Fisher Information Matrix, Policy Gradient Theorem
Uses: Advantage Function, Generalized Advantage Estimation, Importance Sampling (the policy ratio)
Simplified successor: PPO (clipped surrogate, first-order)
Family: Actor-Critic, Policy Gradient Methods, Deep Reinforcement Learning
Contrast: vanilla policy gradient / REINFORCE (parameter-space step, no improvement guarantee)

Study Notes

Explorer

Trust Region Policy Optimization (TRPO)

Trust Region Policy Optimization (TRPO)

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks