RL-Book Ch13 - Policy Gradient Methods

Overview

In this chapter, we transition from Action-Value Methods (which learn values and then derive a policy) to Policy Gradient Methods, which learn a parameterized Policy that can select actions without consulting a value function. While a value function may still be used to learn the policy parameters, it is not required for action selection.

Policy Gradient Methods

Methods that learn a parameterized policy and update the parameters by approximating gradient ascent in a scalar performance measure :

13.1 Policy Approximation and its Advantages

A common parameterization for discrete action spaces is the soft-max in action preferences: where are numerical preferences (e.g., linear in features: ).

Advantages over Action-Value Methods

  1. Convergence to Deterministic Policies: Action-value methods with -greedy always explore. Policy gradients can drive preferences of optimal actions infinitely higher, approaching a deterministic policy.
  2. Stochastic Optimal Policies: In many problems (e.g., imperfect information games like Poker, or the Short Corridor with Switched Actions), the optimal policy is stochastic. Policy gradient methods can learn these specific probabilities naturally.
  3. Simpler Approximation: The policy may be a simpler function to approximate than the value function.
  4. Injection of Prior Knowledge: Parameterization allows specific domain knowledge about the policy’s form to be encoded.
  5. Continuous Action Spaces: Naturally handles infinite action sets by learning distribution statistics (e.g., mean and variance).

13.2 The Policy Gradient Theorem

The challenge in policy gradients is that performance depends on both action selections and the state distribution, the latter of which is often unknown and affected by the policy in complex ways.

The Policy Gradient Theorem (Episodic)

where is the on-policy distribution. Crucially, the gradient does not involve the derivative of the state distribution.

Proof Sketch (Episodic Case)

  1. Start with the gradient of the state-value function: .
  2. Apply the product rule: .
  3. Expand using the Bellman equation: .
  4. Unroll the recurrence: .
  5. After repeated unrolling, we see the gradient is the sum over all states reachable from the start state, weighted by the probability of being in that state at any time step.

13.3 REINFORCE: Monte Carlo Policy Gradient

REINFORCE approximates the gradient using a single sample at time . By noting that , we can express the gradient as an expectation.

The Algorithm

The update rule is:

Eligibility Vector

The vector is the direction in parameter space that most increases the probability of repeating action . The update scales this by the return .

Pseudocode (REINFORCE):

Initialize policy parameter theta
Loop forever (for each episode):
    Generate an episode S0, A0, R1, ..., RT-1, AT-1, RT following pi(.|., theta)
    Loop for each step t = 0, 1, ..., T-1:
        G = sum_{k=t+1}^{T} R_k
        theta = theta + alpha * G * grad_ln_pi(At | St, theta)

13.4 REINFORCE with Baseline

To reduce the high variance of Monte Carlo Methods, we subtract a baseline from the return. can be any function as long as it does not depend on action .

Tip

The most natural baseline is an estimate of the state value .

Update Rule:


13.5 Actor-Critic Methods

While REINFORCE with baseline uses to reduce variance, it still uses the full return , requiring the end of the episode. Actor-Critic methods use Bootstrapping via the TD Error .

Actor and Critic

  • Actor: The learned policy .
  • Critic: The learned state-value function .

The TD error assesses the action:

Pseudocode (One-step Actor-Critic):

Initialize theta, w
Loop forever (for each episode):
    Initialize S
    Loop while S is not terminal:
        A ~ pi(.|S, theta)
        Take action A, observe S', R
        delta = R + gamma * v_hat(S', w) - v_hat(S, w)
        w = w + alpha_w * delta * grad_v_hat(S, w)
        theta = theta + alpha_theta * delta * grad_ln_pi(A|S, theta)
        S = S'

13.7 Continuous Actions Parameterization

For continuous action spaces, we learn the statistics of a probability distribution, typically a Gaussian Policy.

The mean and standard deviation are parameterized:

The policy is defined by the density:

The eligibility vectors are:


Summary

  • Policy Gradient Methods learn directly via Stochastic Gradient Descent.
  • They handle continuous actions and stochastic optimal policies better than action-value methods.
  • The Policy Gradient Theorem provides the theoretical foundation, removing dependence on the state distribution gradient.
  • REINFORCE is the Monte Carlo version; Actor-Critic adds bootstrapping to reduce variance at the cost of some bias.