Maximum Entropy RL

Maximum Entropy RL

A framework that augments the standard RL objective with an entropy bonus, encouraging the agent to maximize expected return while acting as randomly as possible. The agent simultaneously maximizes reward and policy entropy.

Objective

Maximum Entropy Objective

where:

  • — entropy of the policy at state
  • temperature parameter controlling the exploration–exploitation tradeoff
  • — state-action distribution induced by

Intuition

Why Add Entropy?

Standard RL finds a single optimal action per state. Maximum Entropy RL says: “Among all policies that achieve high reward, prefer the one that is most random.” This has several benefits:

  • Better exploration: the agent is incentivized to try diverse actions
  • Robustness: the policy doesn’t collapse to a single brittle action
  • Multi-modality: can capture multiple near-optimal strategies
  • Composability: entropy-regularized policies combine well across tasks

Effect of Temperature

Behavior
Standard (greedy) RL — exploit only
smallSlight exploration bonus
largeHighly stochastic — explore aggressively
Uniform random policy

Soft Bellman Equation

The entropy bonus modifies the Bellman equations:

Soft Value Functions

Key Properties

  • Provides a principled way to trade off exploration and exploitation via
  • Leads to stochastic optimal policies (unlike standard RL which yields deterministic ones)
  • Foundation for Soft Actor-Critic (SAC)
  • Can be interpreted as KL-regularized RL (keeping policy close to a uniform prior)

Connections

Appears In