Soft Actor-Critic (SAC)

Definition

Soft Actor-Critic (SAC)

Soft Actor-Critic is an off-policy actor-critic algorithm that uses entropy regularization. Instead of maximizing just the expected return, the agent maximizes the expected return plus the entropy of the policy, encouraging exploration and robustness.

Objective Function

Soft RL Objective

where:

  • — Entropy of the policy
  • — Temperature parameter that controls the trade-off between reward and entropy (exploration)

Soft Bellman Equation

Soft Value Functions

Soft Policy Improvement

The optimal policy minimizes the KL divergence to the Boltzmann distribution of the Q-function:

Key Components

  1. Actor: Parameterized policy (Gaussian, uses Reparameterization Trick).
  2. Critics: Two soft Q-functions () — take the minimum to mitigate overestimation bias (similar to Double DQN/TD3).
  3. Target Networks: Moving average versions of the Q-functions for stability.
  4. Experience Replay: Off-policy, stores transitions in a buffer to reuse data.

Intuition

Maximum Entropy RL

By including entropy in the objective, the agent is forced to be as “random” as possible while still obtaining rewards. This prevents the policy from collapsing into a single deterministic action too early. It leads to:

  • Better exploration: Testing many promising paths.
  • Robustness: The policy can recover from perturbations because it has learned a wider distribution of behavior.

Key Properties

  • Off-policy: Very sample efficient compared to on-policy methods (like PPO).
  • Stable: One of the most stable and reliable deep RL algorithms for continuous control.
  • Continuous Action Spaces: Primarily designed for continuous tasks (e.g., robotics).

Connections

Appears In