RL-L09: Policy Gradient Methods

Overview

This lecture introduces policy-based methods, which directly optimize the parameters of a policy function rather than learning a value function. While value-based methods learn $V (s)$ or $Q (s, a)$ and derive a deterministic policy from them, policy-based methods explicitly parameterize $π_{θ} (a ∣ s)$ and optimize it using gradient ascent on the expected return.

Policy-based methods address key limitations of action-value methods:

Handle continuous action spaces naturally (no argmax required)
Learn stochastic policies (useful for partial observability and exploration)
Provide policy smoothness guarantees through step size control
Allow incorporation of prior knowledge via policy structure

The core insight is the policy gradient theorem: we can compute an unbiased gradient of expected return w.r.t. policy parameters using only samples of trajectories.

Why Policy-Based Methods?

Limitations of Action-Value Methods

Key Problems with Value-Based Approaches

Continuous actions: Can’t efficiently compute $max_{a} Q (s, a)$ in continuous action spaces

Policy instability: Small changes in $Q$ -values can cause large changes in the greedy policy

Stochastic policies: Impossible to learn stochastic optimal policies (e.g., mixed strategies, handling aliased states)

Exploration: $ϵ$ -greedy exploration is crude; can’t learn optimal exploration strategy

Example of aliased states: If two different states look identical to the agent due to function approximation, the greedy policy might pick the same action for both. A stochastic policy choosing each action with 50% probability could be optimal.

Policy Representation

Stochastic Policies

Instead of learning a value function, we directly parameterize the policy as:

$π_{θ} (a ∣ s) : probability of action a in state s$

Requirements:

Differentiability: $π_{θ}$ must be differentiable w.r.t. $θ$ (to compute gradients)
Stochasticity: Outputs a valid probability distribution over actions

Softmax Policy (Discrete Actions)

For discrete action spaces, use softmax over action preferences:

$π_{θ} (a ∣ s) = \frac{e x p f _{θ} ( s , a )}{\sum _{a^{'} \in A} e x p f _{θ} ( s , a ^{'} )}$

where $f_{θ} (s, a)$ can be linear, neural network, or any differentiable function.

Intuition

The softmax policy acts like a “soft” argmax: preferences with higher values get higher probability, but all actions retain some probability. The temperature-like behavior makes exploration automatic.

Linear Gaussian Policy (Continuous Actions)

For continuous action spaces, parameterize a Gaussian distribution:

$a \sim N (θ^{T} ϕ (s), σ)$

where:

Mean: linear in state features $ϕ (s)$ with weight $θ$
Variance: $σ$ (can be fixed or learned)

Neural Network Policies (Continuous Actions)

With neural networks, output both mean and variance:

$a \sim N (NN_{θ_{μ}} (s), NN_{θ_{σ}} (s))$

This gives highly flexible, nonlinear action selection.

The Policy Gradient Theorem

Objective Function

Every policy $π_{θ}$ has an expected return:

$J (θ) = E_{τ} [G (τ)] = E_{s_{0}, a_{0}, r_{0}, \dots} [\sum_{t = 0}^{T - 1} γ^{t} r_{t}]$

We want to find: $θ^{*} = ar g max_{θ} J (θ)$

Using gradient ascent: $θ_{t + 1} = θ_{t} + α \nabla J (θ_{t})$

Deriving the Gradient

Starting from the definition of $J (θ)$ for episodic tasks:

$\nabla_{θ} J = \nabla_{θ} E_{τ} [G (τ)] = \nabla_{θ} \int p_{θ} (τ) G (τ) d τ$

Using the log-derivative trick ( $\nabla_{x} lo g f (x) = \frac{\nabla _{x} f ( x )}{f ( x )}$ ):

$\nabla_{θ} J = \int \frac{\nabla _{θ} p _{θ} ( τ )}{p _{θ} ( τ )} p_{θ} (τ) G (τ) d τ = E_{τ} [\nabla_{θ} lo g p_{θ} (τ) \cdot G (τ)]$

Factoring the Trajectory Probability

The trajectory probability factors as:

$p_{θ} (τ) = p (s_{0}) \prod_{t = 0}^{T - 1} π_{θ} (a_{t} ∣ s_{t}) \cdot p (s_{t + 1} ∣ a_{t}, s_{t})$

Taking the log and gradient, only the policy terms depend on $θ$ :

$\nabla_{θ} lo g p_{θ} (τ) = \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

(The dynamics and initial state gradients are zero)

Final Result: The Policy Gradient Theorem

Policy Gradient Theorem (Episodic)

$\nabla_{θ} J (θ) = E_{τ} [G (τ) \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})]$

Interpretation: To increase expected return, increase the log-probability of actions in high-return trajectories.

REINFORCE: The Original Policy Gradient Algorithm

Algorithm

The simplest practical implementation: sample trajectories and average the gradient estimate.

REINFORCE Algorithm

Hyperparameters: Step size $α$ , episode length $T$

Repeat:

Sample an episode (trajectory): $τ = (s_{0}, a_{0}, r_{0}, \dots, s_{T})$

Compute return: $G = \sum_{t = 0}^{T - 1} γ^{t} r_{t}$

Update policy: $θ \leftarrow θ + α \cdot G \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

Or with $N$ sampled trajectories (batch update): $\hat{\nabla} J = \frac{1}{N} \sum_{i = 1}^{N} G (τ_{i}) \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t})$

Example: Bernoulli Policy

For a Bernoulli policy with two actions: $π_{θ} (a_{1} ∣ s) = θ, π_{θ} (a_{0} ∣ s) = 1 - θ$

Gradient computation: $\nabla_{θ} lo g π_{θ} (a_{1} ∣ s) = \frac{1}{θ}$ $\nabla_{θ} lo g π_{θ} (a_{0} ∣ s) = - \frac{1}{1 - θ}$

Update rule: if action $a_{1}$ was taken and return is $G$ : $θ \leftarrow θ + α G \cdot \frac{1}{θ}$

Properties

Tip

Unbiased: $E [\hat{\nabla} J] = \nabla J$ — our estimate has correct expectation

Consistent: Converges as sample size increases

Easy: Just requires computing log-policy gradients, no need to know dynamics

On-policy: Must sample from current policy $π_{θ}$

Limitations

High variance: Uses full trajectory return, which compounds over time
Episodic only: Requires episodes of defined length
Slow learning: May need many episodes to estimate gradient accurately

REINFORCE with Baseline

Motivation

A fundamental issue: all actions in a trajectory share credit/blame for the final return.

Intuition

If an episode has:

Time $t = 0$ : good action → good reward

Time $t = 1$ : bad action → bad reward

…

Time $t = T - 1$ : mediocre action

The REINFORCE update uses the same total return $G$ for all actions. The good action gets blamed for later bad actions, and the bad action gets credit for early good rewards.

The Fix: Causality-Aware Gradient

Key insight: Action $a_{t}$ can only affect rewards at time $t$ and later, not before!

$\nabla_{θ} J = E_{τ} [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \sum_{t^{'} = t + 1}^{T - 1} γ^{t^{'} - t} r_{t^{'}}]$

This only uses the return from time $t$ onward:

$G_{t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{t^{'}}$

REINFORCE v2 (with causality)

$\hat{\nabla} J = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \cdot G_{i, t}$ where $G_{i, t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{i, t^{'}}$

Adding a Baseline

Further variance reduction: subtract any baseline $b (s_{t})$ (typically learned value function):

$\nabla_{θ} J \approx E_{τ} [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - b (s_{t}))]$

Why valid? The baseline doesn’t depend on $a_{t}$ , so: $E_{a} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot b (s)] = b (s) E_{a} [\nabla_{θ} lo g π_{θ} (a ∣ s)] = 0$

Baseline

A baseline is any function $b (s)$ that estimates the expected return from a state. Commonly, use a learned value function $V (s)$ or simple running average. Baselines reduce variance without introducing bias.

Learning the Baseline

Often learn $V (s)$ alongside the policy using TD or MC updates:

$V (s) \approx E [G_{t} ∣ s_{t} = s]$

This is straightforward: for each state visited with return $G$ , update: $V (s) \leftarrow V (s) + β (G - V (s))$

Alternative Parametrizations

Softmax Policy Details

For action preferences $f_{θ} (s, a)$ :

$lo g π_{θ} (a ∣ s) = f_{θ} (s, a) - lo g \sum_{a^{'}} exp f_{θ} (s, a^{'})$

Gradient: $\nabla_{θ} lo g π_{θ} (a ∣ s) = \nabla_{θ} f_{θ} (s, a) - E_{a^{'} \sim π_{θ}} [\nabla_{θ} f_{θ} (s, a^{'})]$

This naturally includes an exploration bonus (the second term subtracts the expected preference gradient).

Gaussian Policy (Continuous)

For $a \sim N (μ_{θ} (s), σ)$ :

$lo g π_{θ} (a ∣ s) = - \frac{1}{2 σ ^{2}} (a - μ_{θ} (s))^{2} + const$

Gradient w.r.t. $θ$ : $\nabla_{θ} lo g π_{θ} (a ∣ s) = \frac{1}{σ ^{2}} (a - μ_{θ} (s)) \nabla_{θ} μ_{θ} (s)$

This shows: increase the mean in the direction of good actions, weighted by how far they were from the current mean.

Comparison: Policy Gradient Methods

REINFORCE vs Finite Differences

Aspect	Finite Differences	REINFORCE
Gradient type	Black-box (0-order)	White-box (1st-order)
Variance	Very high (noisy evaluation)	Lower (one sample per step)
Efficiency	Low (needs many rollouts)	Higher
Requires differentiability	No	Yes

REINFORCE v2 vs Original REINFORCE

Aspect	Original	v2 (causality)	v2 + Baseline
Unbiased	✓	✓	✓
Variance	High	Lower	Much lower
Implementation	Simple	Simple	Requires value learning
Practical performance	Poor	Good	Best

Strengths and Weaknesses

Advantages of Policy-Based Methods

Tip

Continuous actions: Natural handling without discretization

Stochastic policies: Can learn optimal exploration/randomness

Convergence guarantees: To local optimum under mild conditions

Prior knowledge: Easy to initialize with expert policies

Smooth updates: Step size control → smooth policy changes

Weaknesses

Warning

High variance: Monte Carlo returns have high variance, especially for long episodes

Episodic setting: Current algorithms require complete episodes

Deterministic policies: Can’t learn truly deterministic optimal policies (though near-deterministic is possible)

Computational cost: Need many trajectory samples to estimate gradients reliably

Slow convergence: Can be slower than value-based methods

Key Concepts Introduced

New Concepts (Concept Notes Created)

The following new concepts are introduced in this lecture and deserve separate study:

Softmax Policy - Stochastic policy using softmax over action preferences
Gaussian Policy - Stochastic policy for continuous actions as Gaussian distribution
Baseline - Value function subtracted from returns to reduce variance in policy gradients
Policy Gradient Theorem - Fundamental result: gradient of expected return w.r.t. policy parameters
REINFORCE - Monte Carlo policy gradient algorithm

Existing Concepts Referenced

Policy Gradient Methods - Central topic
Reinforcement Learning - Field
Policy - Parameterized as $π_{θ} (a ∣ s)$
Return - Discounted sum of rewards $G$
Discount Factor - $γ$
Stochastic Gradient Descent - Optimization method
Function Approximation - Using neural networks for $π_{θ}$
Neural Networks - For policy representation
Gradient Descent - Core update rule $θ \leftarrow θ + α \nabla J$
Value Function - $V (s)$ as baseline
Markov Decision Process - Underlying environment model
Monte Carlo Methods - REINFORCE uses MC sampling
Exploration vs Exploitation - Handled via policy stochasticity
On-Policy Learning - Must sample from $π_{θ}$
Temporal Difference Learning - Value learning alternative to MC
Deep Reinforcement Learning - When using neural networks

Summary and Takeaways

Big Picture

Policy gradient methods directly optimize policy parameters $θ$ using gradient ascent. The policy gradient theorem gives us:

$\nabla_{θ} J = E [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot G_{t}]$

This says: increase log-probability of actions with high return.

REINFORCE implements this via Monte Carlo sampling. Improvements:

Causality: Use only forward returns $G_{t}$ (not full trajectory return)

Baseline: Subtract value function $V (s)$ to reduce variance

These methods naturally handle:

Continuous action spaces

Stochastic optimal policies

Exploration via policy entropy

But they struggle with:

Variance from long episodes

Sample efficiency

Episodic-only settings (so far)

Exam-Ready Facts

Policy gradients = directly optimize $π_{θ}$ , not value function
Core equation: $\nabla_{θ} J = E [\nabla_{θ} lo g π_{θ} (a ∣ s) G_{t}]$
REINFORCE: unbiased but high variance
Baselines reduce variance without introducing bias
Softmax for discrete, Gaussian for continuous actions
On-policy: must sample from current policy
Advantages: continuous actions, stochastic policies, smooth updates
Disadvantages: high variance, slow convergence, episodic only (in basic form)

Study Notes

Explorer

RL-L09 - Policy Gradient Methods

RL-L09: Policy Gradient Methods

Overview

Why Policy-Based Methods?

Limitations of Action-Value Methods

Policy Representation

Stochastic Policies

Softmax Policy (Discrete Actions)

Linear Gaussian Policy (Continuous Actions)

Neural Network Policies (Continuous Actions)

The Policy Gradient Theorem

Objective Function

Deriving the Gradient

Factoring the Trajectory Probability

Final Result: The Policy Gradient Theorem

REINFORCE: The Original Policy Gradient Algorithm

Algorithm

Example: Bernoulli Policy

Properties

Limitations

REINFORCE with Baseline

Motivation

The Fix: Causality-Aware Gradient

Adding a Baseline

Learning the Baseline

Alternative Parametrizations

Softmax Policy Details

Gaussian Policy (Continuous)

Comparison: Policy Gradient Methods

REINFORCE vs Finite Differences

REINFORCE v2 vs Original REINFORCE

Strengths and Weaknesses

Advantages of Policy-Based Methods

Weaknesses

Key Concepts Introduced

New Concepts (Concept Notes Created)

Existing Concepts Referenced

Summary and Takeaways

Exam-Ready Facts

Graph View

Table of Contents

Backlinks