Deterministic Policy Gradient (DPG)

Definition

Deterministic Policy Gradient is a policy gradient method that learns a deterministic policy (outputting a single action per state) using off-policy data. Unlike standard stochastic policy gradients, DPG does not require importance sampling weights, making it more sample-efficient.


Intuition

Problem: Standard policy gradients learn stochastic policies, but:

  • Stochastic policies have inherent exploration but are sample-inefficient
  • For many continuous control tasks, the optimal policy is deterministic (e.g., apply maximum torque)

Solution: Use a deterministic actor policy while learning from data collected by a stochastic behavior policy. The key insight is that the objective function depends only on states (not sampled actions), so importance weights are unnecessary.


Mathematical Formulation

Off-Policy Objective

Instead of maximizing return under the policy being learned, optimize:

where:

  • = state visitation distribution under behavior policy
  • = learned action-value function
  • = deterministic actor policy (outputs single action, not distribution)

Crucial: The integral is only over states, not actions. This avoids the need for importance weights on actions.

The DPG Formula

Taking the gradient:

At :

Practical Update Rule

  1. Actor update (DPG):

  2. Critic update (Q-learning):


Key Properties

1. Off-Policy Learning

  • Learns from data collected by any behavior policy
  • Behavior policy typically: (deterministic policy + noise)
  • Much more sample-efficient than on-policy methods

2. No Importance Sampling Weights

  • Standard off-policy methods need: (importance weights)
  • This can have very high variance (variance explosion with continuous actions)
  • DPG avoids this entirely

3. Deterministic Policy

  • Can learn policies that always take the same action in a given state
  • Matches the greedy solution:
  • Often optimal for well-shaped reward functions

Variants & Extensions

Deep Deterministic Policy Gradient (DDPG)

Applied DPG with deep neural networks:

  • Actor network:
  • Critic network:
  • Target networks for stability
  • Experience replay buffer

Soft Actor-Critic (SAC)

  • Extends DPG to maximum entropy RL
  • Maintains stochastic policy (for exploration) but optimizes deterministic actor
  • Entropy regularization:

TD3 (Twin Delayed DDPG)

  • Addresses overestimation bias in Q-learning
  • Uses two critic networks (twin Q-networks)
  • Delayed policy updates

Comparison: Stochastic vs. Deterministic

AspectStochastic PGDPG
Policy distribution deterministic
ExplorationBuilt-in (entropy)Behavior policy adds noise
Off-policyRequires importance weightsNo weights needed
Gradient
Sample efficiencyLowerHigher
Optimal policyCan be stochasticDeterministic

Advantages

High sample efficiency - Off-policy learning with continuous actions
No importance weights - Avoids variance explosion
Deterministic optimal policies - Matches greedy solution
Continuous action spaces - Natural for continuous control


Disadvantages

✗ Requires learning both actor and critic (two networks)
✗ Can suffer from Q-function overestimation (addressed by TD3)
✗ Needs careful tuning of target networks and learning rates
✗ Less exploration than stochastic policies (relies on behavior policy)


Connections


Key References

  1. Silver, D., et al. (2014). Deterministic Policy Gradient Algorithms. ICML.
  2. Lillicrap, T., et al. (2015). Continuous Control with Deep Reinforcement Learning (DDPG). ICLR.
  3. Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods (TD3). ICML.