Softmax Policy

Definition

A softmax policy is a stochastic policy for discrete action spaces that converts action preference values (scores) into a probability distribution using the softmax function:

$π_{θ} (a ∣ s) = \frac{e x p f _{θ} ( s , a )}{\sum _{a^{'} \in A} e x p f _{θ} ( s , a ^{'} )}$

where:

$f_{θ} (s, a)$ is the preference function (can be linear, neural network, etc.)
The softmax function normalizes preferences into valid probabilities
Preferences with higher values get higher probability

Intuition

Think of softmax as a “soft” version of argmax:

Argmax: Picks action with highest preference with probability 1
Softmax: Weights actions by their preference values, maintaining some probability for all actions

This built-in exploration is automatic: even low-preference actions retain probability mass. The temperature of exploration is controlled by the preference magnitudes.

Mathematical Formulation

Log-Policy Gradient

For implementation, the log-probability is:

$lo g π_{θ} (a ∣ s) = f_{θ} (s, a) - lo g \sum_{a^{'}} exp f_{θ} (s, a^{'})$

The gradient w.r.t. $θ$ :

$\nabla_{θ} lo g π_{θ} (a ∣ s) = \nabla_{θ} f_{θ} (s, a) - E_{a^{'} \sim π_{θ}} [\nabla_{θ} f_{θ} (s, a^{'})]$

This shows two components:

Positive term: Increases the preference for the taken action $a$
Negative term: Decreases the expected preference (regularization toward balanced exploration)

Properties

Normalized: Always sums to 1 over actions
Differentiable: Smooth w.r.t. $θ$ , enabling gradient-based optimization
Exploration: All actions have positive probability (never zero unless explicit constraint)
Entropy: Has natural entropy from probability distribution

Key Properties/Variants

Preference Function Choice

The preference function can be:

Linear: $f_{θ} (s, a) = θ^{T} ϕ (s, a)$ (linear in features)
Neural network: $f_{θ} (s, a) = NN_{θ} (s, a)$ (nonlinear)
Single output: $f_{θ} (s) \in R^{∣ A ∣}$ (network outputs all action preferences at once)

Temperature Scaling

Often seen with explicit temperature parameter:

$π_{θ} (a ∣ s) = \frac{e x p ( f _{θ} ( s , a ) / τ )}{\sum _{a^{'}} e x p ( f _{θ} ( s , a ^{'} ) / τ )}$

Low temperature ( $τ \to 0$ ): More greedy, sharper distribution
High temperature ( $τ \to \infty$ ): More exploration, uniform distribution

Connections

Related to: Boltzmann distribution in statistical mechanics
Basis for: REINFORCE algorithm for discrete actions
Explores via: Entropy of the policy distribution
Contrasts with: Epsilon-Greedy Policy (less smooth, harder to optimize)

Appears In

Policy Gradient Methods — Most natural choice for discrete actions
Actor-Critic — Policy component
PPO — Soft policy parameterization
Deep Reinforcement Learning — When using neural networks for preferences

Study Notes

Explorer

Softmax Policy

Softmax Policy

Definition

Intuition

Mathematical Formulation

Log-Policy Gradient

Properties

Key Properties/Variants

Preference Function Choice

Temperature Scaling

Connections

Appears In

Graph View

Table of Contents

Backlinks