Softmax Policy
Definition
A softmax policy is a stochastic policy for discrete action spaces that converts action preference values (scores) into a probability distribution using the softmax function:
where:
- is the preference function (can be linear, neural network, etc.)
- The softmax function normalizes preferences into valid probabilities
- Preferences with higher values get higher probability
Intuition
Think of softmax as a “soft” version of argmax:
- Argmax: Picks action with highest preference with probability 1
- Softmax: Weights actions by their preference values, maintaining some probability for all actions
This built-in exploration is automatic: even low-preference actions retain probability mass. The temperature of exploration is controlled by the preference magnitudes.
Mathematical Formulation
Log-Policy Gradient
For implementation, the log-probability is:
The gradient w.r.t. :
This shows two components:
- Positive term: Increases the preference for the taken action
- Negative term: Decreases the expected preference (regularization toward balanced exploration)
Properties
- Normalized: Always sums to 1 over actions
- Differentiable: Smooth w.r.t. , enabling gradient-based optimization
- Exploration: All actions have positive probability (never zero unless explicit constraint)
- Entropy: Has natural entropy from probability distribution
Key Properties/Variants
Preference Function Choice
The preference function can be:
- Linear: (linear in features)
- Neural network: (nonlinear)
- Single output: (network outputs all action preferences at once)
Temperature Scaling
Often seen with explicit temperature parameter:
- Low temperature (): More greedy, sharper distribution
- High temperature (): More exploration, uniform distribution
Connections
- Related to: Boltzmann distribution in statistical mechanics
- Basis for: REINFORCE algorithm for discrete actions
- Explores via: Entropy of the policy distribution
- Contrasts with: Epsilon-Greedy Policy (less smooth, harder to optimize)
Appears In
- Policy Gradient Methods — Most natural choice for discrete actions
- Actor-Critic — Policy component
- PPO — Soft policy parameterization
- Deep Reinforcement Learning — When using neural networks for preferences