RL-Book Ch13 - Policy Gradient Methods

Overview

In this chapter, we transition from Action-Value Methods (which learn values and then derive a policy) to Policy Gradient Methods, which learn a parameterized Policy $π (a ∣ s, θ)$ that can select actions without consulting a value function. While a value function may still be used to learn the policy parameters, it is not required for action selection.

Policy Gradient Methods

Methods that learn a parameterized policy $π (a ∣ s, θ)$ and update the parameters $θ \in R^{d}$ by approximating gradient ascent in a scalar performance measure $J (θ)$ : $θ_{t + 1} = θ_{t} + α \nabla J (θ_{t})$

13.1 Policy Approximation and its Advantages

A common parameterization for discrete action spaces is the soft-max in action preferences: $π (a ∣ s, θ) = \frac{e ^{h (s, a, θ)}}{\sum _{b} e ^{h (s, b, θ)}}$ where $h (s, a, θ)$ are numerical preferences (e.g., linear in features: $h (s, a, θ) = θ^{⊤} x (s, a)$ ).

Advantages over Action-Value Methods

Convergence to Deterministic Policies: Action-value methods with $ϵ$ -greedy always explore. Policy gradients can drive preferences of optimal actions infinitely higher, approaching a deterministic policy.
Stochastic Optimal Policies: In many problems (e.g., imperfect information games like Poker, or the Short Corridor with Switched Actions), the optimal policy is stochastic. Policy gradient methods can learn these specific probabilities naturally.
Simpler Approximation: The policy may be a simpler function to approximate than the value function.
Injection of Prior Knowledge: Parameterization allows specific domain knowledge about the policy’s form to be encoded.
Continuous Action Spaces: Naturally handles infinite action sets by learning distribution statistics (e.g., mean and variance).

13.2 The Policy Gradient Theorem

The challenge in policy gradients is that performance $J (θ)$ depends on both action selections and the state distribution, the latter of which is often unknown and affected by the policy in complex ways.

The Policy Gradient Theorem (Episodic)

$\nabla J (θ) \propto \sum_{s} μ (s) \sum_{a} q_{π} (s, a) \nabla π (a ∣ s, θ)$ where $μ (s)$ is the on-policy distribution. Crucially, the gradient does not involve the derivative of the state distribution.

Proof Sketch (Episodic Case)

Start with the gradient of the state-value function: $\nabla v_{π} (s) = \nabla \sum_{a} π (a ∣ s) q_{π} (s, a)$ .
Apply the product rule: $\nabla v_{π} (s) = \sum_{a} [\nabla π (a ∣ s) q_{π} (s, a) + π (a ∣ s) \nabla q_{π} (s, a)]$ .
Expand $\nabla q_{π} (s, a)$ using the Bellman equation: $\nabla q_{π} (s, a) = \sum_{s^{'}} p (s^{'} ∣ s, a) \nabla v_{π} (s^{'})$ .
Unroll the recurrence: $\nabla v_{π} (s) = \sum_{a} \nabla π (a ∣ s) q_{π} (s, a) + \sum_{a} π (a ∣ s) \sum_{s^{'}} p (s^{'} ∣ s, a) \nabla v_{π} (s^{'})$ .
After repeated unrolling, we see the gradient is the sum over all states reachable from the start state, weighted by the probability of being in that state at any time step.

13.3 REINFORCE: Monte Carlo Policy Gradient

REINFORCE approximates the gradient using a single sample $A_{t}$ at time $t$ . By noting that $\nabla π = π \frac{\nabla π}{π} = π \nabla ln π$ , we can express the gradient as an expectation.

The Algorithm

The update rule is: $θ_{t + 1} = θ_{t} + α G_{t} \nabla ln π (A_{t} ∣ S_{t}, θ_{t})$

Eligibility Vector

The vector $\nabla ln π (A_{t} ∣ S_{t}, θ)$ is the direction in parameter space that most increases the probability of repeating action $A_{t}$ . The update scales this by the return $G_{t}$ .

Pseudocode (REINFORCE):

Initialize policy parameter theta
Loop forever (for each episode):
    Generate an episode S0, A0, R1, ..., RT-1, AT-1, RT following pi(.|., theta)
    Loop for each step t = 0, 1, ..., T-1:
        G = sum_{k=t+1}^{T} R_k
        theta = theta + alpha * G * grad_ln_pi(At | St, theta)

13.4 REINFORCE with Baseline

To reduce the high variance of Monte Carlo Methods, we subtract a baseline $b (s)$ from the return. $b (s)$ can be any function as long as it does not depend on action $a$ .

Tip

The most natural baseline is an estimate of the state value $\overset{v}{^} (S_{t}, w)$ .

Update Rule: $θ_{t + 1} = θ_{t} + α (G_{t} - b (S_{t})) \nabla ln π (A_{t} ∣ S_{t}, θ_{t})$

13.5 Actor-Critic Methods

While REINFORCE with baseline uses $\overset{v}{^} (S_{t})$ to reduce variance, it still uses the full return $G_{t}$ , requiring the end of the episode. Actor-Critic methods use Bootstrapping via the TD Error $δ_{t}$ .

Actor and Critic

Actor: The learned policy $π (a ∣ s, θ)$ .

Critic: The learned state-value function $\overset{v}{^} (s, w)$ .

The TD error assesses the action: $δ_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w)$

Pseudocode (One-step Actor-Critic):

Initialize theta, w
Loop forever (for each episode):
    Initialize S
    Loop while S is not terminal:
        A ~ pi(.|S, theta)
        Take action A, observe S', R
        delta = R + gamma * v_hat(S', w) - v_hat(S, w)
        w = w + alpha_w * delta * grad_v_hat(S, w)
        theta = theta + alpha_theta * delta * grad_ln_pi(A|S, theta)
        S = S'

13.7 Continuous Actions Parameterization

For continuous action spaces, we learn the statistics of a probability distribution, typically a Gaussian Policy.

The mean $μ (s, θ)$ and standard deviation $σ (s, θ)$ are parameterized: $μ (s, θ) = θ_{μ}^{⊤} x_{μ} (s)$ $σ (s, θ) = exp (θ_{σ}^{⊤} x_{σ} (s))$

The policy is defined by the density: $π (a ∣ s, θ) = \frac{1}{σ ( s , θ ) 2 π} exp (- \frac{( a - μ ( s , θ ) ) ^{2}}{2 σ ( s , θ ) ^{2}})$

The eligibility vectors are: $\nabla_{θ_{μ}} ln π (a ∣ s, θ) = \frac{a - μ ( s , θ )}{σ ( s , θ ) ^{2}} x_{μ} (s)$ $\nabla_{θ_{σ}} ln π (a ∣ s, θ) = (\frac{( a - μ ( s , θ ) ) ^{2}}{σ ( s , θ ) ^{2}} - 1) x_{σ} (s)$

Summary

Policy Gradient Methods learn $π (a ∣ s)$ directly via Stochastic Gradient Descent.
They handle continuous actions and stochastic optimal policies better than action-value methods.
The Policy Gradient Theorem provides the theoretical foundation, removing dependence on the state distribution gradient.
REINFORCE is the Monte Carlo version; Actor-Critic adds bootstrapping to reduce variance at the cost of some bias.

Study Notes

Explorer

RL-Book Ch13 - Policy Gradient Methods

RL-Book Ch13 - Policy Gradient Methods

Overview

13.1 Policy Approximation and its Advantages

Advantages over Action-Value Methods

13.2 The Policy Gradient Theorem

Proof Sketch (Episodic Case)

13.3 REINFORCE: Monte Carlo Policy Gradient

The Algorithm

13.4 REINFORCE with Baseline

13.5 Actor-Critic Methods

13.7 Continuous Actions Parameterization

Summary

Graph View

Table of Contents

Backlinks