Entropy

Definition

(Shannon) Entropy

The entropy of a discrete probability distribution measures its uncertainty — the expected amount of “surprise” (in nats if using , or bits if using ) when sampling from it. For a policy over actions, the entropy is It is maximized by the uniform distribution (maximum uncertainty / exploration) and minimized (=0) by a deterministic distribution (a point mass — full certainty / greedy).

Intuition

Entropy answers: “how spread out is this distribution?”

  • A uniform policy over actions has the largest entropy, — every action is equally likely, so you are maximally uncertain and maximally exploratory.
  • A deterministic / peaked policy (one action has probability ) has entropy — no surprise, but also no exploration.

In RL we exploit this directly: adding an entropy term to the objective discourages the policy from collapsing too early onto a single action. This keeps the policy stochastic, preserving exploration and preventing premature convergence to a suboptimal deterministic policy. The information-theoretic reading is that is the self-information (“surprisal”) of outcome ; entropy is its expectation.

Mathematical Formulation

Entropy of a policy. For state ,

where:

  • — probability the policy assigns to action in state
  • the sum runs over all actions; for continuous actions it becomes an integral (differential entropy)
  • for discrete distributions, with

Entropy regularization (entropy bonus). Policy-gradient methods add an entropy term to encourage exploration. For REINFORCE / Actor-Critic the per-step objective gradient becomes

where:

  • — return minus Baseline (the Advantage signal driving the policy update)
  • — entropy coefficient (regularization strength); larger more exploration
  • — pushes toward higher entropy (more uniform)

Maximum-entropy objective. Soft Actor-Critic (SAC) augments the reward with an entropy term at every step, yielding the Maximum Entropy RL objective

where:

  • — environment reward
  • temperature, trading off reward vs. entropy ( recovers standard RL)
  • — policy entropy, here treated as an intrinsic reward for acting stochastically

Key Properties / Variants

  • Bounds: (discrete). Maximum at uniform , minimum at a deterministic .
  • Concavity: is a concave function of the distribution, so an entropy bonus is a concave regularizer (well-behaved for gradient ascent).
  • Self-information: ; the integrand is the surprisal of a single outcome.
  • Relation to cross-entropy / KL: , i.e. cross-entropy entropy KL divergence. Minimizing KL with fixed is the same as minimizing cross-entropy.
  • Temperature link: in a Softmax Policy , raising raises entropy (toward uniform); lowering drives entropy to (toward argmax).
  • Differential entropy: for a continuous policy (e.g. a Gaussian Policy) entropy depends on the variance; a Gaussian’s entropy is per dimension. Unlike the discrete case it can be negative.

Computing an entropy bonus for a softmax policy:

Function: entropy_bonus(logits, beta)
─────────────────────────────────────
  p   ← softmax(logits)                 # action probabilities π(a|s)
  logp ← log_softmax(logits)            # numerically stable log π(a|s)
  H   ← -Σ_a  p[a] * logp[a]            # Shannon entropy of the policy
  return beta * H                       # add to objective (gradient ASCENT on H)

Sign and Coefficient

Entropy is added to the objective for gradient ascent (or its negative is subtracted from a loss for gradient descent). Get the sign wrong and you penalize exploration, collapsing the policy. The coefficient ( or temperature ) must be tuned/annealed: too high keeps the policy near-uniform and it never exploits; too low gives no exploration benefit. In SAC, is often learned automatically to hit a target entropy.

Connections

Appears In