RL Lecture 10 - Advanced Policy Search Methods

Overview & Motivation

This lecture builds on foundational policy gradient methods, moving from simple stochastic policies to more sophisticated approaches. We tackle two main challenges: high variance in policy gradient estimates and the need for deterministic policies in continuous action spaces. The lecture introduces the Policy Gradient Theorem (PGT), actor-critic methods that use critics to reduce variance, and deterministic policy gradients for learning greedy policies off-policy.

The core tension in policy-based RL is the bias-variance tradeoff: Monte Carlo returns are unbiased but high-variance, while TD estimates are lower-variance but potentially biased.

REINFORCE v2: Revisited

Before moving to advanced methods, we recap the monte-carlo policy gradient approach:

$\nabla_{θ} J (θ) = E_{τ} [\sum_{t = 1}^{T} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \sum_{t^{'} = t + 1}^{T} γ^{t^{'} - t} r_{t^{'}}]$

This can be estimated as: $\nabla_{θ} J (θ) \approx \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 1}^{T} \nabla_{θ} lo g π_{θ} (a_{t}^{(i)} ∣ s_{t}^{(i)}) G_{t}^{(i)}$

Tip

Only future rewards ( $t^{'} > t$ ) contribute to the gradient at time $t$ , since only future actions affect future rewards (causality).

Policy Gradient Theorem

Formal Statement

The Policy Gradient Theorem elegantly replaces hard-to-estimate expected returns with the action-value function:

$\nabla J (π_{θ}) = E_{τ \sim μ^{π}, a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) Q^{π} (s, a)]$

Equivalently (restating the expectation): $\nabla J (π_{θ}) \propto E_{μ (s) π (a ∣ s)} [\nabla_{θ} lo g π (a ∣ s) q^{π} (s, a)]$

where $μ (s)$ is the on-policy distribution of states.

Key Insight

State-action pairs at each timestep contribute equally to the total gradient
This is proportional to an expectation over the on-policy distribution over state-action pairs
We can replace expected returns $G_{t}$ with learned $Q$ estimates

Reference

See Sutton et al., Policy Gradient Methods for Reinforcement Learning with Function Approximation for the formal proof (including the continuing case).

Actor-Critic Methods

Motivation: Addressing Variance

Intuition

Instead of waiting for the full episode return (Monte Carlo), we can use a learned critic to estimate the value, reducing variance at the cost of potential bias.

The Actor-Critic Update Rule

$θ_{t + 1} = θ_{t} + α (R_{t + 1} + \overset{v}{^} (s_{t + 1}, w)) \nabla_{θ} lo g π (a_{t} ∣ s_{t}, θ_{t})$

where:

Actor: the policy $π (a ∣ s, θ)$ updated using policy gradients
Critic: value function $\overset{v}{^} (s, w)$ (or action-value $\hat{Q} (s, a, w)$ ) estimating the discounted future return

Definition

Actor-Critic: A method that uses both a parametrized policy (actor) and a parametrized value function (critic), with parameters $θ$ and $w$ respectively.

Actor-Critic with Baseline

We can further improve by subtracting a baseline $\overset{v}{^} (s_{t}, w)$ :

$θ_{t + 1} = θ_{t} + α (R_{t + 1} + \overset{v}{^} (s_{t + 1}, w) - \overset{v}{^} (s_{t}, w)) \nabla_{θ} lo g π (a_{t} ∣ s_{t}, θ_{t})$

The quantity in parentheses is the temporal difference (TD) error, or advantage: $δ = R_{t + 1} + γ \overset{v}{^} (s_{t + 1}, w) - \overset{v}{^} (s_{t}, w)$

Formula

TD Advantage Estimate:

Advantage Function: $A (s, a) = Q (s, a) - V (s)$

In practice: $\hat{A}_{t} = R_{t + 1} + γ \overset{v}{^} (s_{t + 1}) - \overset{v}{^} (s_{t})$

Removes the value baseline, reducing variance while remaining unbiased if critic is perfect

Advantages & Disadvantages

Advantages:

Reduces variance compared to REINFORCE (Monte Carlo)
Can be used in both episodic and continuing settings
More sample-efficient than pure actor-only methods

Disadvantages:

Introduces bias (critic may be inaccurate)
“Fiddly”: requires managing two function approximators (actor + critic)
Requires stochastic policies (what if a deterministic policy is optimal?)
Many hyperparameters and moving parts

Deterministic Policy Gradients (DPG)

Motivation: From Stochastic to Deterministic

Intuition

All policy gradients so far learned a stochastic policy. But deterministic policies can be more sample-efficient and optimal for many tasks (e.g., continuous control where the optimal action is to apply maximum force).

Problem with direct application: With deterministic policies, the action distribution has zero variance, so $\nabla_{θ} lo g π_{θ} (a ∣ s) = 0$ and the standard policy gradient is zero.

The DPG Idea: Off-Policy Learning

To address this, we use two policies:

Behavior policy $β (a ∣ s)$ : typically $π (a ∣ s) + noise$ (e.g., Gaussian noise)
Target/actor policy $π_{θ} (a ∣ s)$ : deterministic, what we want to learn

We collect data under $β$ but optimize $π_{θ}$ .

Changing the Objective

Instead of maximizing returns under the policy we’re learning, we optimize:

$J (π_{θ}) = \int_{S} ρ^{β} (s) V^{π} (s) d s = \int_{S} ρ^{β} (s) Q^{π} (s, π_{θ} (s)) d s$

where $ρ^{β} (s)$ is the state distribution under behavior policy $β$ .

Key

Crucial: This objective depends only on states, not sampled actions! We integrate over state space only, removing the need for importance weights on actions.

The Deterministic Policy Gradient

Taking the gradient of the objective:

$\nabla_{θ} J (π_{θ}) = \int_{S} ρ^{β} (s) \nabla_{θ} Q^{π} (s, π_{θ} (s)) d s$

Using the chain rule: $\nabla_{θ} Q^{π} (s, a) = \nabla_{a} Q^{π} (s, a) \nabla_{θ} π_{θ} (s)$

At $a = π_{θ} (s)$ :

$\nabla_{θ} J (π_{θ}) = E_{s \sim ρ^{β}} [\nabla_{θ} π_{θ} (s) \nabla_{a} Q^{π} (s, a) ∣_{a = π_{θ} (s)}]$

This is the off-policy deterministic policy gradient.

Formula

DPG:

Deterministic actor: $μ_{θ} (s)$ (output an action, not a distribution)

Critic: $Q^{π} (s, a)$ estimates action-value

Gradient: $\nabla_{θ} J (μ_{θ}) = E_{s} [\nabla_{a} Q (s, a) ∣_{a = μ_{θ} (s)} \nabla_{θ} μ_{θ} (s)]$

No importance sampling needed!

Advantages of DPG

Off-policy learning: Can learn from data collected by any behavior policy
Deterministic policy: Can learn exact greedy policies
No importance sampling: Avoids variance explosion from importance weights
More sample efficient: Better for continuous control

Connection to Q-Learning

Like Q-learning, DPG finds the greedy policy that maximizes $Q$ (but tractably for continuous actions): $π (s) = ar g max_{a} Q^{π} (s, a)$

Natural Policy Gradient (NPG)

Definition

Natural Gradient: Instead of moving in the direction of steepest ascent in parameter space, move in the direction of steepest ascent in policy space (as measured by KL divergence).

The standard (vanilla) gradient is: $\nabla J = gradient in parameter space$

The natural gradient is: $\tilde{\nabla} J = F^{- 1} \nabla J$

where $F$ is the Fisher Information Matrix: $F = E [\nabla_{θ} lo g π_{θ} (a ∣ s) \nabla_{θ} lo g π_{θ} (a ∣ s)^{T}]$

Intuition

The Fisher Information Matrix rescales gradients to account for the geometry of the policy distribution. Directions that change the policy more receive smaller steps.

Advantages

Reduces variance and stabilizes learning
Takes into account the curvature of the policy space
More principled step sizes
Trade-off: Computational cost (inverting a matrix)

Advantage Function & Generalized Advantage Estimation (GAE)

The Advantage Function

$A (s, a) = Q (s, a) - V (s)$

The advantage measures how much better an action is compared to the baseline value.

Formula

TD Advantage: $\hat{A}_{t} = r_{t} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$ Also called the temporal difference error or 1-step advantage.

Generalized Advantage Estimation (GAE)

A single TD step is biased. We can take multiple steps:

$\hat{A}_{t}^{(1)} = δ_{t} = r_{t} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})$

$\hat{A}_{t}^{(2)} = δ_{t} + γ δ_{t + 1}$

$\hat{A}_{t}^{(\infty)} = \sum_{l = 0}^{\infty} γ^{l} δ_{t + l} = r_{t} + γ r_{t + 1} + \dots - \hat{V} (s_{t})$ (This is the Monte Carlo return minus the baseline.)

GAE smooths between these: $\hat{A}_{t}^{GAE (γ, λ)} = (1 - λ) \sum_{l = 1}^{\infty} λ^{l - 1} \hat{A}_{t}^{(l)}$

Key

$λ$ controls the bias-variance tradeoff:

$λ = 0$ : Single TD step (low variance, high bias)

$λ = 1$ : Monte Carlo (high variance, unbiased)

$λ \in (0, 1)$ : Interpolation

Trust Region Methods (Overview)

While not detailed here, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are important advanced methods built on PGT and actor-critic.

Tip

Key idea: Constrain policy updates to a “trust region” where the approximation of $J (θ)$ is accurate, preventing large destructive policy changes.

The Policy Search Landscape

                    RL Methods
                        |
        ________________|________________
       |                                 |
   Value-Based                    Policy-Based
       |                                 |
   Q-learning          Policy Gradient Methods
   SARSA              /      |          |      \
   MC                /       |          |       \
                 REINFORCE  PGT    Actor-Critic  DPG
                     |       |          |         |
                   (v1)     (v2)    (w/baseline) (deterministic)
                                      /  |  \
                                  1-step GAE ∞-step

When to Use Policy-Based Methods

Policy search methods are typically preferred when:

Continuous action spaces: Easy to parameterize continuous policies; hard to handle with Q-learning
Stochastic policies needed: Exploration naturally built in
Prior knowledge about policy structure: Can encode domain knowledge in policy architecture
Small policy updates required: For physical systems (robots) that can’t handle sudden policy changes
Deterministic optimal policies: Use DPG for true greedy policies

Summary & Key Takeaways

Summary

Core Contributions of This Lecture:

Policy Gradient Theorem: Replaces hard-to-estimate returns with $Q$ -function expectations, enabling actor-critic methods

Actor-Critic: Combines policy updates (actor) with value function learning (critic) to reduce variance while maintaining theoretical grounding

Deterministic Policy Gradients: Enables off-policy learning of deterministic policies without importance sampling weights

Advantage Functions & GAE: Provides a principled way to interpolate between biased TD and unbiased MC advantage estimates

Natural Policy Gradient: Incorporates policy space geometry for more stable, principled updates

New Concepts to Explore

The following concepts are introduced but require deeper study:

Deterministic Policy Gradient - Off-policy learning of deterministic policies
Natural Policy Gradient - Fisher Information Matrix and policy space geometry
Advantage Function - Baseline-corrected policy updates
Generalized Advantage Estimation - Bias-variance tradeoff in TD advantage estimates
Trust Region Policy Optimization (TRPO) - Constrained policy updates
Proximal Policy Optimization (PPO) - Practical approximation to TRPO
Compatible Function Approximation - Conditions for unbiased critic in actor-critic
Soft Actor-Critic (SAC) - Maximum entropy RL with deterministic policies

References

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmüller, M. (2014). Deterministic Policy Gradient Algorithms. ICML.
Peters, J., & Schaal, S. (2008). Reinforcement Learning of Motor Skills with Policy Search. Handbook of Robotics.
Schulman, G., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR.

Study Notes

Explorer

RL-L10 - Advanced Policy Search

RL Lecture 10 - Advanced Policy Search Methods

Overview & Motivation

REINFORCE v2: Revisited

Policy Gradient Theorem

Formal Statement

Key Insight

Reference

Actor-Critic Methods

Motivation: Addressing Variance

The Actor-Critic Update Rule

Actor-Critic with Baseline

Advantages & Disadvantages

Deterministic Policy Gradients (DPG)

Motivation: From Stochastic to Deterministic

The DPG Idea: Off-Policy Learning

Changing the Objective

The Deterministic Policy Gradient

Advantages of DPG

Connection to Q-Learning

Natural Policy Gradient (NPG)

Advantages

Advantage Function & Generalized Advantage Estimation (GAE)

The Advantage Function

Generalized Advantage Estimation (GAE)

Trust Region Methods (Overview)

The Policy Search Landscape

When to Use Policy-Based Methods

Summary & Key Takeaways

New Concepts to Explore

References

Graph View

Table of Contents

Backlinks