Soft Actor-Critic (SAC)

Definition

Soft Actor-Critic (SAC)

Soft Actor-Critic is an off-policy actor-critic algorithm that uses entropy regularization. Instead of maximizing just the expected return, the agent maximizes the expected return plus the entropy of the policy, encouraging exploration and robustness.

Objective Function

Soft RL Objective

$J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim ρ_{π}} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]$

where:

$H (π (\cdot ∣ s_{t})) = - \int π (a ∣ s_{t}) lo g π (a ∣ s_{t}) d a$ — Entropy of the policy

$α$ — Temperature parameter that controls the trade-off between reward and entropy (exploration)

Soft Bellman Equation

Soft Value Functions

$V_{soft} (s) = E_{a \sim π} [Q_{soft} (s, a) - α lo g π (a ∣ s)]$ $Q_{soft} (s, a) = r (s, a) + γ E_{s^{'} \sim p} [V_{soft} (s^{'})]$

Soft Policy Improvement

The optimal policy minimizes the KL divergence to the Boltzmann distribution of the Q-function:

$π_{new} = ar g min_{π^{'}} D_{KL} (π^{'} (\cdot ∣ s) \frac{1}{Z} exp (\frac{Q ^{π_{old}} ( s , \cdot )}{α}))$

Key Components

Actor: Parameterized policy $π_{θ}$ (Gaussian, uses Reparameterization Trick).
Critics: Two soft Q-functions ( $Q_{w_{1}}, Q_{w_{2}}$ ) — take the minimum to mitigate overestimation bias (similar to Double DQN/TD3).
Target Networks: Moving average versions of the Q-functions for stability.
Experience Replay: Off-policy, stores transitions in a buffer to reuse data.

Intuition

Maximum Entropy RL

By including entropy in the objective, the agent is forced to be as “random” as possible while still obtaining rewards. This prevents the policy from collapsing into a single deterministic action too early. It leads to:

Better exploration: Testing many promising paths.

Robustness: The policy can recover from perturbations because it has learned a wider distribution of behavior.

Key Properties

Off-policy: Very sample efficient compared to on-policy methods (like PPO).
Stable: One of the most stable and reliable deep RL algorithms for continuous control.
Continuous Action Spaces: Primarily designed for continuous tasks (e.g., robotics).

Connections

An instance of: Actor-Critic
Built on: Maximum Entropy RL framework
Improves upon: DDPG (which is often unstable)
Uses: Reparameterization Trick for policy gradients, Experience Replay, Target Network
Related: Deterministic Policy Gradient (DDPG is the deterministic counterpart)

Appears In

RL-L11 - SAC, Decision Transformer & Diffuser

Study Notes

Explorer

Soft Actor-Critic (SAC)

Soft Actor-Critic (SAC)

Definition

Objective Function

Soft Bellman Equation

Soft Policy Improvement

Key Components

Intuition

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks