RL Lecture 10 - Advanced Policy Search Methods
Overview & Motivation
This lecture builds on foundational policy gradient methods, moving from simple stochastic policies to more sophisticated approaches. We tackle two main challenges: high variance in policy gradient estimates and the need for deterministic policies in continuous action spaces. The lecture introduces the Policy Gradient Theorem (PGT), actor-critic methods that use critics to reduce variance, and deterministic policy gradients for learning greedy policies off-policy.
The core tension in policy-based RL is the bias-variance tradeoff: Monte Carlo returns are unbiased but high-variance, while TD estimates are lower-variance but potentially biased.
REINFORCE v2: Revisited
Before moving to advanced methods, we recap the monte-carlo policy gradient approach:
This can be estimated as:
Tip
Only future rewards () contribute to the gradient at time , since only future actions affect future rewards (causality).
Policy Gradient Theorem
Formal Statement
The Policy Gradient Theorem elegantly replaces hard-to-estimate expected returns with the action-value function:
Equivalently (restating the expectation):
where is the on-policy distribution of states.
Key Insight
- State-action pairs at each timestep contribute equally to the total gradient
- This is proportional to an expectation over the on-policy distribution over state-action pairs
- We can replace expected returns with learned estimates
Reference
See Sutton et al., Policy Gradient Methods for Reinforcement Learning with Function Approximation for the formal proof (including the continuing case).
Actor-Critic Methods
Motivation: Addressing Variance
Intuition
Instead of waiting for the full episode return (Monte Carlo), we can use a learned critic to estimate the value, reducing variance at the cost of potential bias.
The Actor-Critic Update Rule
where:
- Actor: the policy updated using policy gradients
- Critic: value function (or action-value ) estimating the discounted future return
Definition
Actor-Critic: A method that uses both a parametrized policy (actor) and a parametrized value function (critic), with parameters and respectively.
Actor-Critic with Baseline
We can further improve by subtracting a baseline :
The quantity in parentheses is the temporal difference (TD) error, or advantage:
Formula
TD Advantage Estimate:
- Advantage Function:
- In practice:
- Removes the value baseline, reducing variance while remaining unbiased if critic is perfect
Advantages & Disadvantages
Advantages:
- Reduces variance compared to REINFORCE (Monte Carlo)
- Can be used in both episodic and continuing settings
- More sample-efficient than pure actor-only methods
Disadvantages:
- Introduces bias (critic may be inaccurate)
- “Fiddly”: requires managing two function approximators (actor + critic)
- Requires stochastic policies (what if a deterministic policy is optimal?)
- Many hyperparameters and moving parts
Deterministic Policy Gradients (DPG)
Motivation: From Stochastic to Deterministic
Intuition
All policy gradients so far learned a stochastic policy. But deterministic policies can be more sample-efficient and optimal for many tasks (e.g., continuous control where the optimal action is to apply maximum force).
Problem with direct application: With deterministic policies, the action distribution has zero variance, so and the standard policy gradient is zero.
The DPG Idea: Off-Policy Learning
To address this, we use two policies:
- Behavior policy : typically (e.g., Gaussian noise)
- Target/actor policy : deterministic, what we want to learn
We collect data under but optimize .
Changing the Objective
Instead of maximizing returns under the policy we’re learning, we optimize:
where is the state distribution under behavior policy .
Key
Crucial: This objective depends only on states, not sampled actions! We integrate over state space only, removing the need for importance weights on actions.
The Deterministic Policy Gradient
Taking the gradient of the objective:
Using the chain rule:
At :
This is the off-policy deterministic policy gradient.
Formula
DPG:
- Deterministic actor: (output an action, not a distribution)
- Critic: estimates action-value
- Gradient:
- No importance sampling needed!
Advantages of DPG
- Off-policy learning: Can learn from data collected by any behavior policy
- Deterministic policy: Can learn exact greedy policies
- No importance sampling: Avoids variance explosion from importance weights
- More sample efficient: Better for continuous control
Connection to Q-Learning
Like Q-learning, DPG finds the greedy policy that maximizes (but tractably for continuous actions):
Natural Policy Gradient (NPG)
Definition
Natural Gradient: Instead of moving in the direction of steepest ascent in parameter space, move in the direction of steepest ascent in policy space (as measured by KL divergence).
The standard (vanilla) gradient is:
The natural gradient is:
where is the Fisher Information Matrix:
Intuition
The Fisher Information Matrix rescales gradients to account for the geometry of the policy distribution. Directions that change the policy more receive smaller steps.
Advantages
- Reduces variance and stabilizes learning
- Takes into account the curvature of the policy space
- More principled step sizes
- Trade-off: Computational cost (inverting a matrix)
Advantage Function & Generalized Advantage Estimation (GAE)
The Advantage Function
The advantage measures how much better an action is compared to the baseline value.
Formula
TD Advantage: Also called the temporal difference error or 1-step advantage.
Generalized Advantage Estimation (GAE)
A single TD step is biased. We can take multiple steps:
(This is the Monte Carlo return minus the baseline.)
GAE smooths between these:
Key
controls the bias-variance tradeoff:
- : Single TD step (low variance, high bias)
- : Monte Carlo (high variance, unbiased)
- : Interpolation
Trust Region Methods (Overview)
While not detailed here, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are important advanced methods built on PGT and actor-critic.
Tip
Key idea: Constrain policy updates to a “trust region” where the approximation of is accurate, preventing large destructive policy changes.
The Policy Search Landscape
RL Methods
|
________________|________________
| |
Value-Based Policy-Based
| |
Q-learning Policy Gradient Methods
SARSA / | | \
MC / | | \
REINFORCE PGT Actor-Critic DPG
| | | |
(v1) (v2) (w/baseline) (deterministic)
/ | \
1-step GAE ∞-step
When to Use Policy-Based Methods
Policy search methods are typically preferred when:
- Continuous action spaces: Easy to parameterize continuous policies; hard to handle with Q-learning
- Stochastic policies needed: Exploration naturally built in
- Prior knowledge about policy structure: Can encode domain knowledge in policy architecture
- Small policy updates required: For physical systems (robots) that can’t handle sudden policy changes
- Deterministic optimal policies: Use DPG for true greedy policies
Summary & Key Takeaways
Summary
Core Contributions of This Lecture:
Policy Gradient Theorem: Replaces hard-to-estimate returns with -function expectations, enabling actor-critic methods
Actor-Critic: Combines policy updates (actor) with value function learning (critic) to reduce variance while maintaining theoretical grounding
Deterministic Policy Gradients: Enables off-policy learning of deterministic policies without importance sampling weights
Advantage Functions & GAE: Provides a principled way to interpolate between biased TD and unbiased MC advantage estimates
Natural Policy Gradient: Incorporates policy space geometry for more stable, principled updates
New Concepts to Explore
The following concepts are introduced but require deeper study:
- Deterministic Policy Gradient - Off-policy learning of deterministic policies
- Natural Policy Gradient - Fisher Information Matrix and policy space geometry
- Advantage Function - Baseline-corrected policy updates
- Generalized Advantage Estimation - Bias-variance tradeoff in TD advantage estimates
- Trust Region Policy Optimization (TRPO) - Constrained policy updates
- Proximal Policy Optimization (PPO) - Practical approximation to TRPO
- Compatible Function Approximation - Conditions for unbiased critic in actor-critic
- Soft Actor-Critic (SAC) - Maximum entropy RL with deterministic policies
References
- Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS.
- Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmüller, M. (2014). Deterministic Policy Gradient Algorithms. ICML.
- Peters, J., & Schaal, S. (2008). Reinforcement Learning of Motor Skills with Policy Search. Handbook of Robotics.
- Schulman, G., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.