Actor-Critic

Definition

Actor-Critic Methods

Actor-Critic methods are a class of Reinforcement Learning algorithms that combine both policy-based and value-based approaches. They consist of two components:

  1. Actor: Responsible for selecting actions by maintaining a parameterized policy .
  2. Critic: Responsible for evaluating the actions taken by the actor by estimating a value function (usually the state-value function or action-value function ).

Intuition

Evaluation Reduces Variance

In pure policy gradient methods like REINFORCE, the update uses the full episode return , which has high variance because it depends on many random actions and transitions. Actor-Critic methods replace the return with a value estimate from the Critic. This “bootstrapping” significantly reduces variance, leading to faster and more stable learning, though at the cost of introducing some bias from the value estimate.

Mathematical Formulation

The Actor update typically follows the gradient of the performance objective, often using the Advantage function to indicate how much better an action was than average:

Actor-Critic Update (Advantage Actor-Critic)

Actor Update:

Critic Update (TD Error):

where:

  • is the Advantage (estimated by the TD error)
  • — Actor parameters
  • — Critic parameters

Variants and Key Properties

  • A2C (Advantage Actor-Critic): Synchronous version where multiple workers update a global model.
  • A3C (Asynchronous Advantage Actor-Critic): Asynchronous version where workers update global parameters independently.
  • Relationship to Policy Gradient: It is a specialized form of the Policy Gradient Theorem where the Critic provides the -value or Advantage estimate.

Connections

Appears In