Proximal Policy Optimization (PPO)
PPO
A policy gradient algorithm that constrains the policy update to stay close to the current policy, preventing destructively large updates. It uses a clipped surrogate objective as a simpler alternative to trust-region methods (TRPO).
Intuition
Standard Policy Gradient Methods can take steps that are too large, causing the policy to collapse or oscillate. PPO addresses this by clipping the probability ratio between old and new policies, ensuring conservative updates without needing complex second-order optimization.
Clipped Surrogate Objective
where:
- — probability ratio between new and old policy
- — estimated advantage at time
- — clipping hyperparameter (typically 0.1–0.2)
- The operation takes the more pessimistic (conservative) bound
Why Clipping Works
If the ratio moves too far from 1 (policy changed a lot), the clip cuts off the objective’s gradient, stopping the update. This means the policy can improve but can’t change drastically in a single step.
Algorithm Sketch
Algorithm: PPO (Clip version)
──────────────────────────────
For each iteration:
1. Collect T timesteps of data using current policy π_θ_old
2. Compute advantages Â_t (e.g., GAE)
3. For K epochs over the collected data:
- Compute r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
- Compute L^CLIP(θ)
- Update θ via gradient ascent on L^CLIP
4. θ_old ← θKey Properties
- Simple to implement compared to TRPO (no conjugate gradient, no KL constraint)
- Empirically strong: works well across many domains (Atari, MuJoCo, robotics)
- Sample efficient: reuses data across K epochs per collection phase
- Monotonic improvement guarantee (approximate): clipping prevents catastrophic updates
Connections
- Extension of REINFORCE and Policy Gradient Theorem
- Uses Actor-Critic framework (policy + value function)
- Alternative to TRPO (Trust Region Policy Optimization)
- Often combined with Generalized Advantage Estimation (GAE)
Appears In
- RL course Week 5–6 (policy gradient methods)