Policy Gradient Methods

Definition

Policy Gradient Methods

Policy Gradient (PG) methods are a family of reinforcement learning algorithms that directly optimize a parameterized policy by following the gradient of some performance measure with respect to the policy parameters . Unlike value-based methods, they do not require a value function to select actions (though they may use one for variance reduction).

The Policy Gradient Theorem

Policy Gradient Theorem

For any differentiable policy , the gradient of the performance measure (under certain conditions) is:

In practice, this leads to the stochastic gradient ascent update:

where:

  • — policy parameters
  • — the return (reward to go)
  • — the “score function” or eligibility vector

Advantages vs Value-Based Methods

  1. Continuous Action Spaces: Can learn exact action probabilities or parameters of a distribution (e.g., mean/std of a Gaussian), whereas is hard in continuous space.
  2. Stochastic Policies: Can learn the optimal stochastic policy (e.g., in Rock-Paper-Scissors or Aliased Gridworlds), while value-based methods are typically deterministic.
  3. Convergence: Often have stronger convergence guarantees as changes in parameters lead to smooth changes in the policy.

Key Algorithms

  • REINFORCE: The basic Monte Carlo PG algorithm using the full return .
  • Actor-Critic: Uses a “Critic” (value function) to estimate instead of the full return to reduce variance.
  • PPO (Proximal Policy Optimization): A modern standard that uses a clipped objective to prevent destructively large updates.

Connections

Appears In