Gaussian Policy
Definition
A Gaussian policy is a stochastic policy for continuous action spaces that models the action distribution as a multivariate Gaussian (normal) distribution:
where:
- is the mean (mean action), parameterized by
- is the covariance matrix (controls exploration magnitude)
- is the continuous action
Intuition
For continuous control (e.g., robot joint angles, continuous force), we need a policy that:
- Learns a preferred action (the mean)
- Maintains uncertainty/exploration around that mean
- Adjusts both mean and variance based on state
Gaussian policy naturally provides this: it’s differentiable, supported on , and captures both exploitation (mean) and exploration (variance).
Mathematical Formulation
Probability Density
For a diagonal Gaussian (common simplification):
Log-Policy (for gradient computation)
Gradient w.r.t. Mean
Interpretation: Update mean in direction of the action error, scaled by inverse variance.
Gradient w.r.t. Variance
This shows variance should increase when actions are far from mean, decrease when close.
Key Properties/Variants
Mean Parameterization
Common choices:
-
Linear:
- Simple, interpretable
- Good for linear relationships
-
Neural network:
- Highly expressive
- Standard for deep RL
Variance Parameterization
-
Fixed variance: is a hyperparameter, not learned
- Simpler, faster
- May require careful tuning
-
Learned scalar variance: One per dimension
- Adapts exploration per action dimension
- Common in practice
-
State-dependent variance: also learned
- Maximum flexibility
- Needs careful initialization
-
Log-variance: Often parameterize to ensure positivity
Diagonal vs Full Covariance
- Diagonal (most common):
- Simpler gradient computation
- Assumes action dimensions are independent
- Full covariance: Allows correlation between actions
- More expressive, more expensive
- Rarely needed
Connections
- Related to: Normal distribution, Continuous control
- Basis for: Policy Gradient Methods for continuous actions
- Alternative to: Softmax Policy (which is for discrete actions)
- Enables: Smooth, differentiable action sampling
Appears In
- Policy Gradient Methods — Standard for continuous action spaces
- REINFORCE — Continuous control variant
- Actor-Critic — Continuous action actor
- PPO — Continuous benchmark tasks
- Deep Deterministic Policy Gradient — Alternative to Gaussian (deterministic policy)
- Soft Actor-Critic (SAC) — Uses Gaussian policies with entropy regularization