REINFORCE

Definition

REINFORCE

REINFORCE (also known as Monte Carlo Policy Gradient) is a policy gradient algorithm that uses the observed return from full episodes to estimate the gradient of the performance objective. It was introduced by Ronald Williams in 1992.

Algorithm

The algorithm follows the Policy Gradient Theorem by using the return as an unbiased (but high-variance) estimate of .

Algorithm: REINFORCE (Monte Carlo Policy Gradient)
────────────────────────────────────────────────────
Input: a differentiable policy parameterization π(a|s,θ)
Algorithm parameters: step size α > 0
Initialize policy parameters θ ∈ ℝ^d (e.g., to 0)
 
Loop forever (for each episode):
  Generate an episode S₀, A₀, R₁, ..., S_{T-1}, A_{T-1}, R_T following π(·|·,θ)
  Loop for each step of the episode t = 0, 1, ..., T-1:
    G ← Σ_{k=t+1}^{T} γ^{k-t-1} R_k
    θ ← θ + α γ^t G ∇ log π(A_t|S_t, θ)

Update Rule

REINFORCE Update

where:

  • — Total discounted return from time to end of episode
  • — Direction in parameter space that increases the probability of taking action in state

Key Properties

  • Monte Carlo: Requires full episodes to calculate .
  • Unbiased: The gradient estimate is unbiased.
  • High Variance: Because depends on all future rewards and actions in the episode, updates can be very noisy.
  • Simplest PG: Often the first policy gradient method taught because it doesn’t require a Critic/Value Function.

REINFORCE with Baseline

To reduce variance, a baseline (often a learned state-value function ) is subtracted from the return: This maintains unbiasedness while reducing the magnitude of the updates.

Connections

  • Direct application of: Policy Gradient Theorem
  • Precursor to: Actor-Critic (which bootstraps instead of using full )
  • Limited by: High variance and off-policy instability

Appears In

  • future Week 5 lecture