Q-Learning

Definition

Q-Learning

Q-learning is an off-policy TD control algorithm. It directly approximates the optimal action-value function $q_{*}$ , regardless of the policy being followed. The key insight: the update target uses $max_{a} Q (S_{t + 1}, a)$ — the value of the best action in the next state — not the action actually taken.

Update Rule

Q-Learning Update

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

where:

$α$ — step size (learning rate)

$R_{t + 1} + γ max_{a} Q (S_{t + 1}, a)$ — TD target (using best next action)

$γ max_{a} Q (S_{t + 1}, a)$ — bootstrapped estimate of future value under optimal policy

Algorithm

Algorithm: Q-Learning (Off-Policy TD Control)
──────────────────────────────────────────────
Initialize Q(s,a) arbitrarily for all s,a
  (Q(terminal, ·) = 0)
 
Loop for each episode:
  Initialize S
  Loop for each step of episode:
    Choose A from S using policy derived from Q
      (e.g., ε-greedy w.r.t. Q)
    Take action A, observe R, S'
    Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
    S ← S'
  until S is terminal

Why Off-Policy?

The Max Makes It Off-Policy

The behavior policy (used to select actions) is typically ε-greedy for exploration. But the update target uses $max_{a}$ — the greedy policy’s value. So we’re learning about the greedy (optimal) policy while following an exploratory policy.

Unlike SARSA, Q-learning doesn’t need Importance Sampling corrections because the max operation directly estimates $q_{*}$ .

Q-Learning vs SARSA

Property	Q-Learning	SARSA
Type	Off-policy	On-policy
Target	$R + γ max_{a} Q (S^{'}, a)$	$R + γ Q (S^{'}, A^{'})$
Learns about	Optimal (greedy) policy	Current (ε-greedy) policy
Cliff Walking behavior	Finds optimal (risky) path	Finds safer path
Convergence	To $q_{*}$ (with conditions)	To $q_{π}$ for current ε-greedy $π$

Convergence

Q-learning converges to $q_{*}$ under standard conditions:

All state-action pairs visited infinitely often
Step sizes satisfy: $\sum_{t} α_{t} = \infty$ and $\sum_{t} α_{t}^{2} < \infty$

With Function Approximation

Tabular Q-learning converges. With Function Approximation, Q-learning can diverge (the Deadly Triad). This motivated Deep Q-Network (DQN)‘s stabilization techniques (Experience Replay + Target Network).

Connections

Instance of: Temporal Difference Learning (off-policy control)
Compared with: SARSA (on-policy), Expected SARSA
Extended by: Double Q-learning, Deep Q-Network (DQN)
Danger with FA: Deadly Triad

Study Notes

Explorer

Q-Learning

Q-Learning

Definition

Update Rule

Algorithm

Why Off-Policy?

Q-Learning vs SARSA

Convergence

Connections

Appears In

Graph View

Table of Contents

Backlinks