Proximal Policy Optimization (PPO)

PPO

A policy gradient algorithm that constrains the policy update to stay close to the current policy, preventing destructively large updates. It uses a clipped surrogate objective as a simpler alternative to trust-region methods (TRPO).

Intuition

Standard Policy Gradient Methods can take steps that are too large, causing the policy to collapse or oscillate. PPO addresses this by clipping the probability ratio between old and new policies, ensuring conservative updates without needing complex second-order optimization.

Clipped Surrogate Objective

$L^{C L I P} (θ) = \hat{E}_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]$

where:

$r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )}$ — probability ratio between new and old policy
$\hat{A}_{t}$ — estimated advantage at time $t$
$ϵ$ — clipping hyperparameter (typically 0.1–0.2)
The $min$ operation takes the more pessimistic (conservative) bound

Why Clipping Works

If the ratio $r_{t}$ moves too far from 1 (policy changed a lot), the clip cuts off the objective’s gradient, stopping the update. This means the policy can improve but can’t change drastically in a single step.

Algorithm Sketch

Algorithm: PPO (Clip version)
──────────────────────────────
For each iteration:
  1. Collect T timesteps of data using current policy π_θ_old
  2. Compute advantages Â_t (e.g., GAE)
  3. For K epochs over the collected data:
     - Compute r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
     - Compute L^CLIP(θ)
     - Update θ via gradient ascent on L^CLIP
  4. θ_old ← θ

Key Properties

Simple to implement compared to TRPO (no conjugate gradient, no KL constraint)
Empirically strong: works well across many domains (Atari, MuJoCo, robotics)
Sample efficient: reuses data across K epochs per collection phase
Monotonic improvement guarantee (approximate): clipping prevents catastrophic updates

Connections

Extension of REINFORCE and Policy Gradient Theorem
Uses Actor-Critic framework (policy + value function)
Alternative to TRPO (Trust Region Policy Optimization)
Often combined with Generalized Advantage Estimation (GAE)

Appears In

RL course Week 5–6 (policy gradient methods)

Study Notes

Explorer

PPO

Proximal Policy Optimization (PPO)

Intuition

Clipped Surrogate Objective

Algorithm Sketch

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks