RL-L09: Policy Gradient Methods
Overview
This lecture introduces policy-based methods, which directly optimize the parameters of a policy function rather than learning a value function. While value-based methods learn or and derive a deterministic policy from them, policy-based methods explicitly parameterize and optimize it using gradient ascent on the expected return.
Policy-based methods address key limitations of action-value methods:
- Handle continuous action spaces naturally (no argmax required)
- Learn stochastic policies (useful for partial observability and exploration)
- Provide policy smoothness guarantees through step size control
- Allow incorporation of prior knowledge via policy structure
The core insight is the policy gradient theorem: we can compute an unbiased gradient of expected return w.r.t. policy parameters using only samples of trajectories.
Why Policy-Based Methods?
Limitations of Action-Value Methods
Key Problems with Value-Based Approaches
- Continuous actions: Can’t efficiently compute in continuous action spaces
- Policy instability: Small changes in -values can cause large changes in the greedy policy
- Stochastic policies: Impossible to learn stochastic optimal policies (e.g., mixed strategies, handling aliased states)
- Exploration: -greedy exploration is crude; can’t learn optimal exploration strategy
Example of aliased states: If two different states look identical to the agent due to function approximation, the greedy policy might pick the same action for both. A stochastic policy choosing each action with 50% probability could be optimal.
Policy Representation
Stochastic Policies
Instead of learning a value function, we directly parameterize the policy as:
Requirements:
- Differentiability: must be differentiable w.r.t. (to compute gradients)
- Stochasticity: Outputs a valid probability distribution over actions
Softmax Policy (Discrete Actions)
For discrete action spaces, use softmax over action preferences:
where can be linear, neural network, or any differentiable function.
Intuition
The softmax policy acts like a “soft” argmax: preferences with higher values get higher probability, but all actions retain some probability. The temperature-like behavior makes exploration automatic.
Linear Gaussian Policy (Continuous Actions)
For continuous action spaces, parameterize a Gaussian distribution:
where:
- Mean: linear in state features with weight
- Variance: (can be fixed or learned)
Neural Network Policies (Continuous Actions)
With neural networks, output both mean and variance:
This gives highly flexible, nonlinear action selection.
The Policy Gradient Theorem
Objective Function
Every policy has an expected return:
We want to find:
Using gradient ascent:
Deriving the Gradient
Starting from the definition of for episodic tasks:
Using the log-derivative trick ():
Factoring the Trajectory Probability
The trajectory probability factors as:
Taking the log and gradient, only the policy terms depend on :
(The dynamics and initial state gradients are zero)
Final Result: The Policy Gradient Theorem
Policy Gradient Theorem (Episodic)
Interpretation: To increase expected return, increase the log-probability of actions in high-return trajectories.
REINFORCE: The Original Policy Gradient Algorithm
Algorithm
The simplest practical implementation: sample trajectories and average the gradient estimate.
REINFORCE Algorithm
Hyperparameters: Step size , episode length
Repeat:
- Sample an episode (trajectory):
- Compute return:
- Update policy:
Or with sampled trajectories (batch update):
Example: Bernoulli Policy
For a Bernoulli policy with two actions:
Gradient computation:
Update rule: if action was taken and return is :
Properties
Tip
- Unbiased: — our estimate has correct expectation
- Consistent: Converges as sample size increases
- Easy: Just requires computing log-policy gradients, no need to know dynamics
- On-policy: Must sample from current policy
Limitations
- High variance: Uses full trajectory return, which compounds over time
- Episodic only: Requires episodes of defined length
- Slow learning: May need many episodes to estimate gradient accurately
REINFORCE with Baseline
Motivation
A fundamental issue: all actions in a trajectory share credit/blame for the final return.
Intuition
If an episode has:
- Time : good action → good reward
- Time : bad action → bad reward
- …
- Time : mediocre action
The REINFORCE update uses the same total return for all actions. The good action gets blamed for later bad actions, and the bad action gets credit for early good rewards.
The Fix: Causality-Aware Gradient
Key insight: Action can only affect rewards at time and later, not before!
This only uses the return from time onward:
REINFORCE v2 (with causality)
where
Adding a Baseline
Further variance reduction: subtract any baseline (typically learned value function):
Why valid? The baseline doesn’t depend on , so:
Baseline
A baseline is any function that estimates the expected return from a state. Commonly, use a learned value function or simple running average. Baselines reduce variance without introducing bias.
Learning the Baseline
Often learn alongside the policy using TD or MC updates:
This is straightforward: for each state visited with return , update:
Alternative Parametrizations
Softmax Policy Details
For action preferences :
Gradient:
This naturally includes an exploration bonus (the second term subtracts the expected preference gradient).
Gaussian Policy (Continuous)
For :
Gradient w.r.t. :
This shows: increase the mean in the direction of good actions, weighted by how far they were from the current mean.
Comparison: Policy Gradient Methods
REINFORCE vs Finite Differences
| Aspect | Finite Differences | REINFORCE |
|---|---|---|
| Gradient type | Black-box (0-order) | White-box (1st-order) |
| Variance | Very high (noisy evaluation) | Lower (one sample per step) |
| Efficiency | Low (needs many rollouts) | Higher |
| Requires differentiability | No | Yes |
REINFORCE v2 vs Original REINFORCE
| Aspect | Original | v2 (causality) | v2 + Baseline |
|---|---|---|---|
| Unbiased | ✓ | ✓ | ✓ |
| Variance | High | Lower | Much lower |
| Implementation | Simple | Simple | Requires value learning |
| Practical performance | Poor | Good | Best |
Strengths and Weaknesses
Advantages of Policy-Based Methods
Tip
- Continuous actions: Natural handling without discretization
- Stochastic policies: Can learn optimal exploration/randomness
- Convergence guarantees: To local optimum under mild conditions
- Prior knowledge: Easy to initialize with expert policies
- Smooth updates: Step size control → smooth policy changes
Weaknesses
Warning
- High variance: Monte Carlo returns have high variance, especially for long episodes
- Episodic setting: Current algorithms require complete episodes
- Deterministic policies: Can’t learn truly deterministic optimal policies (though near-deterministic is possible)
- Computational cost: Need many trajectory samples to estimate gradients reliably
- Slow convergence: Can be slower than value-based methods
Key Concepts Introduced
New Concepts (Concept Notes Created)
The following new concepts are introduced in this lecture and deserve separate study:
- Softmax Policy - Stochastic policy using softmax over action preferences
- Gaussian Policy - Stochastic policy for continuous actions as Gaussian distribution
- Baseline - Value function subtracted from returns to reduce variance in policy gradients
- Policy Gradient Theorem - Fundamental result: gradient of expected return w.r.t. policy parameters
- REINFORCE - Monte Carlo policy gradient algorithm
Existing Concepts Referenced
- Policy Gradient Methods - Central topic
- Reinforcement Learning - Field
- Policy - Parameterized as
- Return - Discounted sum of rewards
- Discount Factor -
- Stochastic Gradient Descent - Optimization method
- Function Approximation - Using neural networks for
- Neural Networks - For policy representation
- Gradient Descent - Core update rule
- Value Function - as baseline
- Markov Decision Process - Underlying environment model
- Monte Carlo Methods - REINFORCE uses MC sampling
- Exploration vs Exploitation - Handled via policy stochasticity
- On-Policy Learning - Must sample from
- Temporal Difference Learning - Value learning alternative to MC
- Deep Reinforcement Learning - When using neural networks
Summary and Takeaways
Big Picture
Policy gradient methods directly optimize policy parameters using gradient ascent. The policy gradient theorem gives us:
This says: increase log-probability of actions with high return.
REINFORCE implements this via Monte Carlo sampling. Improvements:
- Causality: Use only forward returns (not full trajectory return)
- Baseline: Subtract value function to reduce variance
These methods naturally handle:
- Continuous action spaces
- Stochastic optimal policies
- Exploration via policy entropy
But they struggle with:
- Variance from long episodes
- Sample efficiency
- Episodic-only settings (so far)
Exam-Ready Facts
- Policy gradients = directly optimize , not value function
- Core equation:
- REINFORCE: unbiased but high variance
- Baselines reduce variance without introducing bias
- Softmax for discrete, Gaussian for continuous actions
- On-policy: must sample from current policy
- Advantages: continuous actions, stochastic policies, smooth updates
- Disadvantages: high variance, slow convergence, episodic only (in basic form)