Deterministic Policy Gradient (DPG)

Definition

Deterministic Policy Gradient is a policy gradient method that learns a deterministic policy (outputting a single action per state) using off-policy data. Unlike standard stochastic policy gradients, DPG does not require importance sampling weights, making it more sample-efficient.

Intuition

Problem: Standard policy gradients learn stochastic policies, but:

Stochastic policies have inherent exploration but are sample-inefficient
For many continuous control tasks, the optimal policy is deterministic (e.g., apply maximum torque)

Solution: Use a deterministic actor policy while learning from data collected by a stochastic behavior policy. The key insight is that the objective function depends only on states (not sampled actions), so importance weights are unnecessary.

Mathematical Formulation

Off-Policy Objective

Instead of maximizing return under the policy being learned, optimize:

$J (π_{θ}) = \int_{S} ρ^{β} (s) Q^{π} (s, π_{θ} (s)) d s$

where:

$ρ^{β} (s)$ = state visitation distribution under behavior policy $β$
$Q^{π} (s, a)$ = learned action-value function
$π_{θ} (s)$ = deterministic actor policy (outputs single action, not distribution)

Crucial: The integral is only over states, not actions. This avoids the need for importance weights on actions.

The DPG Formula

Taking the gradient: $\nabla_{θ} Q^{π} (s, a) = \nabla_{a} Q^{π} (s, a) \nabla_{θ} π_{θ} (s) (chain rule)$

At $a = π_{θ} (s)$ :

$\nabla_{θ} J (π_{θ}) = E_{s \sim ρ^{β}} [\nabla_{a} Q^{π} (s, a) ∣_{a = π_{θ} (s)} \cdot \nabla_{θ} π_{θ} (s)]$

Practical Update Rule

Actor update (DPG): $θ_{t + 1} = θ_{t} + α \nabla_{a} Q (s_{t}, a) ∣_{a = μ_{θ} (s_{t})} \nabla_{θ} μ_{θ} (s_{t})$
Critic update (Q-learning): $w_{t + 1} = w_{t} + β (r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}, w) - Q (s_{t}, a_{t}, w)) \nabla_{w} Q (s_{t}, a_{t}, w)$

Key Properties

1. Off-Policy Learning

Learns from data collected by any behavior policy $β$
Behavior policy typically: $β (a ∣ s) = π_{θ} (a ∣ s) + N (0, σ)$ (deterministic policy + noise)
Much more sample-efficient than on-policy methods

2. No Importance Sampling Weights

Standard off-policy methods need: $\frac{p ( a ∣ s )}{q ( a ∣ s )}$ (importance weights)
This can have very high variance (variance explosion with continuous actions)
DPG avoids this entirely

3. Deterministic Policy

Can learn policies that always take the same action in a given state
Matches the greedy solution: $π (s) = ar g max_{a} Q (s, a)$
Often optimal for well-shaped reward functions

Variants & Extensions

Deep Deterministic Policy Gradient (DDPG)

Applied DPG with deep neural networks:

Actor network: $μ_{θ} (s) \to a$
Critic network: $Q_{w} (s, a) \to R$
Target networks for stability
Experience replay buffer

Soft Actor-Critic (SAC)

Extends DPG to maximum entropy RL
Maintains stochastic policy (for exploration) but optimizes deterministic actor
Entropy regularization: $J (π) = E [Q (s, a)] + α H (π)$

TD3 (Twin Delayed DDPG)

Addresses overestimation bias in Q-learning
Uses two critic networks (twin Q-networks)
Delayed policy updates

Comparison: Stochastic vs. Deterministic

Aspect	Stochastic PG	DPG
Policy	$π (a ∥ s)$ distribution	$μ (s)$ deterministic
Exploration	Built-in (entropy)	Behavior policy adds noise
Off-policy	Requires importance weights	No weights needed
Gradient	$E [\nabla lo g π Q]$	$E [\nabla_{a} Q \nabla μ]$
Sample efficiency	Lower	Higher
Optimal policy	Can be stochastic	Deterministic

Advantages

✓ High sample efficiency - Off-policy learning with continuous actions
✓ No importance weights - Avoids variance explosion
✓ Deterministic optimal policies - Matches greedy solution
✓ Continuous action spaces - Natural for continuous control

Disadvantages

✗ Requires learning both actor and critic (two networks)
✗ Can suffer from Q-function overestimation (addressed by TD3)
✗ Needs careful tuning of target networks and learning rates
✗ Less exploration than stochastic policies (relies on behavior policy)

Connections

Related to: Q-Learning, Actor-Critic, Policy Gradient Methods
Extends: Policy Gradient Theorem from stochastic to deterministic
Foundation for: Soft Actor-Critic (SAC), DDPG, TD3
Appears in: Deep Reinforcement Learning

Key References

Silver, D., et al. (2014). Deterministic Policy Gradient Algorithms. ICML.
Lillicrap, T., et al. (2015). Continuous Control with Deep Reinforcement Learning (DDPG). ICLR.
Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods (TD3). ICML.

Study Notes

Explorer

Deterministic Policy Gradient