Q-Learning

Definition

Q-Learning

Q-learning is an off-policy TD control algorithm. It directly approximates the optimal action-value function , regardless of the policy being followed. The key insight: the update target uses — the value of the best action in the next state — not the action actually taken.

Update Rule

Q-Learning Update

where:

  • — step size (learning rate)
  • — TD target (using best next action)
  • — bootstrapped estimate of future value under optimal policy

Algorithm

Algorithm: Q-Learning (Off-Policy TD Control)
──────────────────────────────────────────────
Initialize Q(s,a) arbitrarily for all s,a
  (Q(terminal, ·) = 0)
 
Loop for each episode:
  Initialize S
  Loop for each step of episode:
    Choose A from S using policy derived from Q
      (e.g., ε-greedy w.r.t. Q)
    Take action A, observe R, S'
    Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
    S ← S'
  until S is terminal

Why Off-Policy?

The Max Makes It Off-Policy

The behavior policy (used to select actions) is typically ε-greedy for exploration. But the update target uses — the greedy policy’s value. So we’re learning about the greedy (optimal) policy while following an exploratory policy.

Unlike SARSA, Q-learning doesn’t need Importance Sampling corrections because the max operation directly estimates .

Q-Learning vs SARSA

PropertyQ-LearningSARSA
TypeOff-policyOn-policy
Target
Learns aboutOptimal (greedy) policyCurrent (ε-greedy) policy
Cliff Walking behaviorFinds optimal (risky) pathFinds safer path
ConvergenceTo (with conditions)To for current ε-greedy

Convergence

Q-learning converges to under standard conditions:

  1. All state-action pairs visited infinitely often
  2. Step sizes satisfy: and

With Function Approximation

Tabular Q-learning converges. With Function Approximation, Q-learning can diverge (the Deadly Triad). This motivated Deep Q-Network (DQN)‘s stabilization techniques (Experience Replay + Target Network).

Connections

Appears In