RL Lecture 10 - Advanced Policy Search Methods

Overview & Motivation

This lecture builds on foundational policy gradient methods, moving from simple stochastic policies to more sophisticated approaches. We tackle two main challenges: high variance in policy gradient estimates and the need for deterministic policies in continuous action spaces. The lecture introduces the Policy Gradient Theorem (PGT), actor-critic methods that use critics to reduce variance, and deterministic policy gradients for learning greedy policies off-policy.

The core tension in policy-based RL is the bias-variance tradeoff: Monte Carlo returns are unbiased but high-variance, while TD estimates are lower-variance but potentially biased.


REINFORCE v2: Revisited

Before moving to advanced methods, we recap the monte-carlo policy gradient approach:

This can be estimated as:

Tip

Only future rewards () contribute to the gradient at time , since only future actions affect future rewards (causality).


Policy Gradient Theorem

Formal Statement

The Policy Gradient Theorem elegantly replaces hard-to-estimate expected returns with the action-value function:

Equivalently (restating the expectation):

where is the on-policy distribution of states.

Key Insight

  • State-action pairs at each timestep contribute equally to the total gradient
  • This is proportional to an expectation over the on-policy distribution over state-action pairs
  • We can replace expected returns with learned estimates

Reference

See Sutton et al., Policy Gradient Methods for Reinforcement Learning with Function Approximation for the formal proof (including the continuing case).


Actor-Critic Methods

Motivation: Addressing Variance

Intuition

Instead of waiting for the full episode return (Monte Carlo), we can use a learned critic to estimate the value, reducing variance at the cost of potential bias.

The Actor-Critic Update Rule

where:

  • Actor: the policy updated using policy gradients
  • Critic: value function (or action-value ) estimating the discounted future return

Definition

Actor-Critic: A method that uses both a parametrized policy (actor) and a parametrized value function (critic), with parameters and respectively.

Actor-Critic with Baseline

We can further improve by subtracting a baseline :

The quantity in parentheses is the temporal difference (TD) error, or advantage:

Formula

TD Advantage Estimate:

  • Advantage Function:
  • In practice:
  • Removes the value baseline, reducing variance while remaining unbiased if critic is perfect

Advantages & Disadvantages

Advantages:

  • Reduces variance compared to REINFORCE (Monte Carlo)
  • Can be used in both episodic and continuing settings
  • More sample-efficient than pure actor-only methods

Disadvantages:

  • Introduces bias (critic may be inaccurate)
  • “Fiddly”: requires managing two function approximators (actor + critic)
  • Requires stochastic policies (what if a deterministic policy is optimal?)
  • Many hyperparameters and moving parts

Deterministic Policy Gradients (DPG)

Motivation: From Stochastic to Deterministic

Intuition

All policy gradients so far learned a stochastic policy. But deterministic policies can be more sample-efficient and optimal for many tasks (e.g., continuous control where the optimal action is to apply maximum force).

Problem with direct application: With deterministic policies, the action distribution has zero variance, so and the standard policy gradient is zero.

The DPG Idea: Off-Policy Learning

To address this, we use two policies:

  • Behavior policy : typically (e.g., Gaussian noise)
  • Target/actor policy : deterministic, what we want to learn

We collect data under but optimize .

Changing the Objective

Instead of maximizing returns under the policy we’re learning, we optimize:

where is the state distribution under behavior policy .

Key

Crucial: This objective depends only on states, not sampled actions! We integrate over state space only, removing the need for importance weights on actions.

The Deterministic Policy Gradient

Taking the gradient of the objective:

Using the chain rule:

At :

This is the off-policy deterministic policy gradient.

Formula

DPG:

  • Deterministic actor: (output an action, not a distribution)
  • Critic: estimates action-value
  • Gradient:
  • No importance sampling needed!

Advantages of DPG

  • Off-policy learning: Can learn from data collected by any behavior policy
  • Deterministic policy: Can learn exact greedy policies
  • No importance sampling: Avoids variance explosion from importance weights
  • More sample efficient: Better for continuous control

Connection to Q-Learning

Like Q-learning, DPG finds the greedy policy that maximizes (but tractably for continuous actions):


Natural Policy Gradient (NPG)

Definition

Natural Gradient: Instead of moving in the direction of steepest ascent in parameter space, move in the direction of steepest ascent in policy space (as measured by KL divergence).

The standard (vanilla) gradient is:

The natural gradient is:

where is the Fisher Information Matrix:

Intuition

The Fisher Information Matrix rescales gradients to account for the geometry of the policy distribution. Directions that change the policy more receive smaller steps.

Advantages

  • Reduces variance and stabilizes learning
  • Takes into account the curvature of the policy space
  • More principled step sizes
  • Trade-off: Computational cost (inverting a matrix)

Advantage Function & Generalized Advantage Estimation (GAE)

The Advantage Function

The advantage measures how much better an action is compared to the baseline value.

Formula

TD Advantage: Also called the temporal difference error or 1-step advantage.

Generalized Advantage Estimation (GAE)

A single TD step is biased. We can take multiple steps:

(This is the Monte Carlo return minus the baseline.)

GAE smooths between these:

Key

controls the bias-variance tradeoff:

  • : Single TD step (low variance, high bias)
  • : Monte Carlo (high variance, unbiased)
  • : Interpolation

Trust Region Methods (Overview)

While not detailed here, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are important advanced methods built on PGT and actor-critic.

Tip

Key idea: Constrain policy updates to a “trust region” where the approximation of is accurate, preventing large destructive policy changes.


The Policy Search Landscape

                    RL Methods
                        |
        ________________|________________
       |                                 |
   Value-Based                    Policy-Based
       |                                 |
   Q-learning          Policy Gradient Methods
   SARSA              /      |          |      \
   MC                /       |          |       \
                 REINFORCE  PGT    Actor-Critic  DPG
                     |       |          |         |
                   (v1)     (v2)    (w/baseline) (deterministic)
                                      /  |  \
                                  1-step GAE ∞-step

When to Use Policy-Based Methods

Policy search methods are typically preferred when:

  1. Continuous action spaces: Easy to parameterize continuous policies; hard to handle with Q-learning
  2. Stochastic policies needed: Exploration naturally built in
  3. Prior knowledge about policy structure: Can encode domain knowledge in policy architecture
  4. Small policy updates required: For physical systems (robots) that can’t handle sudden policy changes
  5. Deterministic optimal policies: Use DPG for true greedy policies

Summary & Key Takeaways

Summary

Core Contributions of This Lecture:

  1. Policy Gradient Theorem: Replaces hard-to-estimate returns with -function expectations, enabling actor-critic methods

  2. Actor-Critic: Combines policy updates (actor) with value function learning (critic) to reduce variance while maintaining theoretical grounding

  3. Deterministic Policy Gradients: Enables off-policy learning of deterministic policies without importance sampling weights

  4. Advantage Functions & GAE: Provides a principled way to interpolate between biased TD and unbiased MC advantage estimates

  5. Natural Policy Gradient: Incorporates policy space geometry for more stable, principled updates


New Concepts to Explore

The following concepts are introduced but require deeper study:


References

  • Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS.
  • Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmüller, M. (2014). Deterministic Policy Gradient Algorithms. ICML.
  • Peters, J., & Schaal, S. (2008). Reinforcement Learning of Motor Skills with Policy Search. Handbook of Robotics.
  • Schulman, G., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.