Soft Actor-Critic (SAC)
Definition
Soft Actor-Critic (SAC)
Soft Actor-Critic is an off-policy actor-critic algorithm that uses entropy regularization. Instead of maximizing just the expected return, the agent maximizes the expected return plus the entropy of the policy, encouraging exploration and robustness.
Objective Function
Soft RL Objective
where:
- — Entropy of the policy
- — Temperature parameter that controls the trade-off between reward and entropy (exploration)
Key Components
- Actor: Parameterized policy .
- Critics: SAC typically uses two soft Q-functions () to mitigate overestimation bias (similar to Double DQN).
- Target Networks: Uses moving average versions of the Q-functions for stability.
- Experience Replay: Being off-policy, it stores transitions in a buffer to reuse data.
Intuition
Maximum Entropy RL
By including entropy in the objective, the agent is forced to be as “random” as possible while still obtaining rewards. This prevents the policy from collapsing into a single deterministic action too early. It leads to:
- Better exploration: Testing many promising paths.
- Robustness: The policy can recover from perturbations because it has learned a wider distribution of behavior.
Key Properties
- Off-policy: Very sample efficient compared to on-policy methods (like PPO).
- Stable: One of the most stable and reliable deep RL algorithms for continuous control.
- Continuous Action Spaces: Primarily designed for continuous tasks (e.g., robotics).
Connections
- An instance of: Actor-Critic
- Improves upon: DDPG (which is often unstable)
- Related concepts: Reward Signal (modified with entropy), State Space
Appears In
- future Week 6 lecture