Trust Region Policy Optimization (TRPO)
Definition
Trust Region Policy Optimization
TRPO is an on-policy policy gradient algorithm that improves the policy by maximizing a surrogate objective (the expected advantage under the new policy, weighted by an importance ratio) subject to a hard constraint that the new policy stays close to the old one in average KL divergence. The KL constraint defines a “trust region” inside which the surrogate is a reliable approximation of the true performance, guaranteeing monotonic (non-decreasing) policy improvement in the idealized case.
Intuition
Vanilla policy gradient takes a fixed-size step in parameter space (). The problem: a small change in can cause a huge change in the actual policy distribution, collapsing performance. Because policy gradient is on-policy, one bad update poisons all future data — there is no recovery from off-policy replay.
TRPO fixes this by measuring step size in policy space instead of parameter space. It asks: “How far can I move the policy and still trust my data-derived estimate of improvement?” The answer is bounded by the KL divergence between old and new policy. Stay inside the trust region , and the surrogate objective faithfully predicts the true return; the update is then guaranteed not to make things worse.
Why a Constraint Instead of a Penalty
Theory (the MM / minorize-maximize bound) actually suggests a KL penalty with a coefficient tied to the max advantage. But that coefficient forces tiny, conservative steps. TRPO instead uses a hard KL constraint as a robust, tunable proxy that allows much larger practical steps while keeping the trust-region guarantee.
Mathematical Formulation
The surrogate objective. TRPO maximizes the expected advantage of the new policy , with state visitation taken from the old policy and actions reweighted by an importance ratio:
The constrained optimization problem solved at each iteration:
where:
- — importance ratio correcting for the fact that data was collected under but we score
- — advantage under the old policy (estimated in practice by GAE)
- — discounted state-visitation distribution under the old policy
- — average KL over visited states (the max-KL version is intractable)
- — trust-region radius (typical: )
Solving it (the natural-gradient connection). Linearize the objective and quadratically approximate the constraint around :
where is the policy gradient and is the Fisher Information Matrix (the Hessian of the KL constraint). The closed-form solution is the natural gradient step scaled to fill the trust region:
where:
- — the natural gradient direction (search direction obtained by solving with conjugate gradient, never forming explicitly)
- — maximal step size along that keeps the quadratic KL at exactly
Monotonic Improvement Bound
TRPO is derived from a guarantee on the true return : with . Maximizing the right-hand side (a minorant of the true return) cannot decrease — this is the monotonic-improvement property. TRPO replaces the penalty with the trust-region constraint for larger, practical steps.
Key Properties / Variants
- On-policy. Fresh trajectories are collected from each iteration; the importance ratio is only a local correction valid near (inside the trust region).
- Second-order / natural-gradient method. The Fisher matrix is the curvature of policy space; TRPO is essentially Natural Policy Gradient with an automatically chosen step size plus safeguards.
- Hessian-free. Conjugate gradient solves using only Fisher-vector products ( via a double back-prop through the KL), avoiding the cost of inverting .
- Backtracking line search enforces the true (non-approximated) constraint and a real improvement in the surrogate — protecting against errors from the quadratic approximation.
- Monotonic improvement guarantee (in the exact setting), unlike vanilla policy gradient which can collapse.
- Cost / drawbacks: the CG + line search machinery is complex and per-update expensive; it does not naturally share parameters between policy and value networks, and is awkward with architectures involving heavy noise/dropout.
- Successor — Proximal Policy Optimization: replaces the hard KL constraint with a clipped surrogate ratio (or a soft KL penalty), recovering most of TRPO’s stability with first-order SGD and far simpler code. PPO is now the default; TRPO is the theoretical anchor.
Algorithm: Trust Region Policy Optimization (TRPO)
──────────────────────────────────────────────────
Input: initial policy θ, value/critic params w, trust radius δ
Loop for each iteration:
1. Collect trajectories by running π_θ_old (θ_old ← θ) in the environment
2. Estimate advantages Â_t (e.g., GAE with the critic V_w)
3. Compute policy gradient g = ∇_θ L(θ) |_θ_old
L(θ) = mean_t [ (π_θ(a_t|s_t) / π_θ_old(a_t|s_t)) · Â_t ]
4. Solve F x = g by conjugate gradient (x ≈ F⁻¹ g, natural-grad direction)
using Fisher-vector products F v = ∇_θ ( (∇_θ KL)·v ) (no explicit F)
5. Compute max step: β = sqrt( 2δ / (xᵀ F x) )
6. Backtracking line search over j = 0,1,2,...:
θ_try = θ_old + (α^j) · β · x (α ∈ (0,1), e.g. 0.5)
accept first θ_try with KL(π_θ_old ‖ π_θ_try) ≤ δ
and L(θ_try) > L(θ_old)
θ ← θ_try (if none accepted, keep θ_old)
7. Fit critic: minimize Σ_t ( V_w(s_t) − R̂_t )²Connections
- Builds directly on: Natural Policy Gradient, Fisher Information Matrix, Policy Gradient Theorem
- Uses: Advantage Function, Generalized Advantage Estimation, Importance Sampling (the policy ratio)
- Simplified successor: PPO (clipped surrogate, first-order)
- Family: Actor-Critic, Policy Gradient Methods, Deep Reinforcement Learning
- Contrast: vanilla policy gradient / REINFORCE (parameter-space step, no improvement guarantee)