Policy Gradient Theorem

Definition

The Policy Gradient Theorem is a fundamental result in reinforcement learning that expresses the gradient of expected return with respect to policy parameters:

This equation is the foundation for all policy gradient methods. It says: to increase expected return, increase the log-probability of actions in high-return trajectories.

Intuition

Why This Makes Sense

Imagine sampling episodes (trajectories) from your current policy:

  • Episodes with high total reward should be reinforced
  • Episodes with low total reward should be deprioritized
  • The log-probability gradient acts as a “handle” to adjust the policy

The algorithm works by:

  1. Sample an episode and measure its return
  2. Compute for each action
  3. Update:

Result: Good actions become more likely, bad actions less likely.

The Log-Derivative Trick

The key technical insight is the log-derivative trick:

This allows us to move the gradient inside an expectation with respect to a distribution that depends on the parameters:

This is why we work with log-probabilities: they make the gradient tractable.

Mathematical Derivation

Starting Point

For an episodic task with trajectories :

Applying the Log-Derivative Trick

Using the log-derivative trick:

Therefore:

Factoring the Trajectory Probability

The trajectory probability factors as:

The log:

Gradient w.r.t. (only the policy term depends on ):

Final Result

Substituting back:

Key Properties

Unbiasedness

The estimator is unbiased: sampling one trajectory gives an unbiased estimate of the true gradient.

Consistency

With enough samples, the empirical average converges to the true gradient (by law of large numbers).

On-Policy

The theorem requires samples from the current policy . Using samples from a different policy (off-policy) requires importance sampling correction.

Direct Dependency on Dynamics is Not Needed

Crucially, the dynamics cancel out in the gradient. We don’t need to know or learn the environment dynamics!

Variations and Extensions

Causality-Aware Version

We can improve variance by noting that action only affects rewards at time onward:

where .

With Baseline

Subtracting any baseline preserves unbiasedness:

Continuous Time / Continuing Tasks

The theorem extends to continuing MDPs with discounted returns:

State Value Version

Can also express as:

where is state visitation distribution and is the action-value function.

Practical Implications

Algorithm Design

The theorem motivates:

  1. REINFORCE: Use Monte Carlo sample of return directly
  2. Actor-Critic: Use learned or estimate instead of full
  3. PPO: Efficient trust-region variant of policy gradients
  4. A2C: Parallel actor-critic with baseline

Gradient Variance

The theorem shows variance comes from:

  • Return sampling: Monte Carlo returns have high variance
  • Policy stochasticity: Exploration adds noise

Variance reduction techniques:

  • Baselines: Subtract expected return
  • Advantage estimates: Use instead of raw returns
  • Function approximation: Smooth out noisy returns

Connections

Appears In