Policy Gradient Methods
Definition
Policy Gradient Methods
Policy Gradient (PG) methods are a family of reinforcement learning algorithms that directly optimize a parameterized policy by following the gradient of some performance measure with respect to the policy parameters . Unlike value-based methods, they do not require a value function to select actions (though they may use one for variance reduction).
The Policy Gradient Theorem
Policy Gradient Theorem
For any differentiable policy , the gradient of the performance measure (under certain conditions) is:
In practice, this leads to the stochastic gradient ascent update:
where:
- — policy parameters
- — the return (reward to go)
- — the “score function” or eligibility vector
Advantages vs Value-Based Methods
- Continuous Action Spaces: Can learn exact action probabilities or parameters of a distribution (e.g., mean/std of a Gaussian), whereas is hard in continuous space.
- Stochastic Policies: Can learn the optimal stochastic policy (e.g., in Rock-Paper-Scissors or Aliased Gridworlds), while value-based methods are typically deterministic.
- Convergence: Often have stronger convergence guarantees as changes in parameters lead to smooth changes in the policy.
Key Algorithms
- REINFORCE: The basic Monte Carlo PG algorithm using the full return .
- Actor-Critic: Uses a “Critic” (value function) to estimate instead of the full return to reduce variance.
- PPO (Proximal Policy Optimization): A modern standard that uses a clipped objective to prevent destructively large updates.
Connections
- Contrast with: Q-Learning, SARSA (Value-based methods)
- Component of: Actor-Critic Methods
- Uses: Neural Networks (usually as the policy function)
- Strategy: Stochastic Gradient Ascent
Appears In
- Future Week 5 Lecture
- RL-Book Ch13 - Policy Gradient Methods