Policy Gradient Methods

Definition

Policy Gradient Methods

Policy Gradient (PG) methods are a family of reinforcement learning algorithms that directly optimize a parameterized policy $π (a ∣ s, θ)$ by following the gradient of some performance measure $J (θ)$ with respect to the policy parameters $θ$ . Unlike value-based methods, they do not require a value function to select actions (though they may use one for variance reduction).

The Policy Gradient Theorem

Policy Gradient Theorem

For any differentiable policy $π (a ∣ s, θ)$ , the gradient of the performance measure $J (θ)$ (under certain conditions) is: $\nabla J (θ) \propto \sum_{s} μ (s) \sum_{a} q_{π} (s, a) \nabla π (a ∣ s, θ)$

In practice, this leads to the stochastic gradient ascent update: $\nabla J (θ) = E_{π} [G_{t} \nabla ln π (A_{t} ∣ S_{t}, θ)]$

where:

$θ$ — policy parameters

$G_{t}$ — the return (reward to go)

$\nabla ln π (A_{t} ∣ S_{t}, θ)$ — the “score function” or eligibility vector

Advantages vs Value-Based Methods

Continuous Action Spaces: Can learn exact action probabilities or parameters of a distribution (e.g., mean/std of a Gaussian), whereas $max Q$ is hard in continuous space.
Stochastic Policies: Can learn the optimal stochastic policy (e.g., in Rock-Paper-Scissors or Aliased Gridworlds), while value-based methods are typically deterministic.
Convergence: Often have stronger convergence guarantees as changes in parameters lead to smooth changes in the policy.

Key Algorithms

REINFORCE: The basic Monte Carlo PG algorithm using the full return $G_{t}$ .
Actor-Critic: Uses a “Critic” (value function) to estimate $q_{π} (s, a)$ instead of the full return to reduce variance.
PPO (Proximal Policy Optimization): A modern standard that uses a clipped objective to prevent destructively large updates.

Connections

Contrast with: Q-Learning, SARSA (Value-based methods)
Component of: Actor-Critic Methods
Uses: Neural Networks (usually as the policy function)
Strategy: Stochastic Gradient Ascent

Appears In

Future Week 5 Lecture
RL-Book Ch13 - Policy Gradient Methods

Study Notes

Explorer

Policy Gradient Methods

Policy Gradient Methods

Definition

The Policy Gradient Theorem

Advantages vs Value-Based Methods

Key Algorithms

Connections

Appears In

Graph View

Table of Contents

Backlinks