Softmax Policy

Definition

A softmax policy is a stochastic policy for discrete action spaces that converts action preference values (scores) into a probability distribution using the softmax function:

where:

  • is the preference function (can be linear, neural network, etc.)
  • The softmax function normalizes preferences into valid probabilities
  • Preferences with higher values get higher probability

Intuition

Think of softmax as a “soft” version of argmax:

  • Argmax: Picks action with highest preference with probability 1
  • Softmax: Weights actions by their preference values, maintaining some probability for all actions

This built-in exploration is automatic: even low-preference actions retain probability mass. The temperature of exploration is controlled by the preference magnitudes.

Mathematical Formulation

Log-Policy Gradient

For implementation, the log-probability is:

The gradient w.r.t. :

This shows two components:

  • Positive term: Increases the preference for the taken action
  • Negative term: Decreases the expected preference (regularization toward balanced exploration)

Properties

  • Normalized: Always sums to 1 over actions
  • Differentiable: Smooth w.r.t. , enabling gradient-based optimization
  • Exploration: All actions have positive probability (never zero unless explicit constraint)
  • Entropy: Has natural entropy from probability distribution

Key Properties/Variants

Preference Function Choice

The preference function can be:

  1. Linear: (linear in features)
  2. Neural network: (nonlinear)
  3. Single output: (network outputs all action preferences at once)

Temperature Scaling

Often seen with explicit temperature parameter:

  • Low temperature (): More greedy, sharper distribution
  • High temperature (): More exploration, uniform distribution

Connections

Appears In