Action-Value Methods

Definition

Action-Value Methods

Action-value methods estimate the value of each action — the expected reward (or return) of selecting it — and then use those estimates to drive action selection. The policy is implicit: it is derived from the value estimates (e.g. greedy or ε-greedy w.r.t. $Q_{t} (a)$ ), rather than being parameterised and learned directly. They are the canonical approach to the Multi-Armed Bandit problem and the conceptual ancestor of SARSA / Q-Learning in the full MDP setting.

Intuition

Estimate first, act second

The core loop is: keep a running estimate of how good each action is, then prefer the action that looks best (while still exploring). You never store a policy explicitly — you store numbers $Q_{t} (a)$ , and the policy is just “look at the numbers and pick”.

This is the natural contrast to Policy Gradient Methods: action-value methods learn $Q$ and read off a policy; policy-gradient methods skip the values and adjust the policy parameters directly. In a bandit, “value of action $a$ ” is simply the expected reward $q_{*} (a)$ ; in a full MDP it becomes the action-value function $q_{π} (s, a)$ .

Mathematical Formulation

The true value of an action is its expected reward:

$q_{*} (a) ≐ E [R_{t} ∣ A_{t} = a]$

The sample-average estimate averages the rewards actually received for $a$ :

$Q_{t} (a) ≐ \frac{\sum _{i = 1}^{t - 1} R _{i} \cdot 1 _{A_{i} = a}}{\sum _{i = 1}^{t - 1} 1 _{A_{i} = a}}$

To avoid storing all past rewards, this is computed with the incremental update rule:

$Q_{n + 1} = Q_{n} + \frac{1}{n} [R_{n} - Q_{n}]$

which has the general “error-correction” form

$NewEstimate \leftarrow OldEstimate + StepSize [Target - OldEstimate]$

where:

$A_{t}$ — action selected at step $t$ ; $R_{t}$ — reward received at step $t$
$q_{*} (a)$ — true (unknown) expected reward of action $a$
$Q_{t} (a)$ — current estimate of $q_{*} (a)$ at step $t$
$1_{A_{i} = a}$ — indicator, $1$ if action $a$ was taken at step $i$ , else $0$
$N_{t} (a) = \sum_{i < t} 1_{A_{i} = a}$ — number of times $a$ has been selected
$\frac{1}{n}$ — step size; the $n$ -th selection of the action uses step $1/ n$
$[R_{n} - Q_{n}]$ — the error between the latest reward (target) and the current estimate

By the Law of Large Numbers, $Q_{t} (a) \to q_{*} (a)$ as each action is sampled infinitely often. For nonstationary problems, replace $1/ n$ with a constant step size $α \in (0, 1]$ :

$Q_{n + 1} = Q_{n} + α [R_{n} - Q_{n}]$

giving an exponential recency-weighted average (recent rewards weighted more heavily):

$Q_{n + 1} = (1 - α)^{n} Q_{1} + \sum_{i = 1}^{n} α (1 - α)^{n - i} R_{i}$

Key Properties / Variants

Selection rule is separate from estimation. Estimation gives $Q_{t} (a)$ ; a selection rule turns it into behaviour:
- Greedy: $A_{t} = ar g max_{a} Q_{t} (a)$ — pure exploitation, can lock onto a suboptimal arm.
- ε-greedy: greedy with prob. $1 - ε$ , uniform-random with prob. $ε$ ; guarantees every action is sampled infinitely often so $Q_{t} (a) \to q_{*} (a)$ .
- UCB: $A_{t} = ar g max_{a} [Q_{t} (a) + c \frac{l n t}{N _{t} ( a )}]$ — directs exploration toward uncertain actions instead of exploring blindly.
- Optimistic Initial Values: set $Q_{1} (a)$ high so early rewards “disappoint” and force trial of all actions; only aids initial exploration.
Step-size convergence (stochastic approximation): estimates converge w.p. 1 iff $\sum_{n} α_{n} (a) = \infty$ and $\sum_{n} α_{n}^{2} (a) < \infty$ . Sample-average ( $1/ n$ ) satisfies both; constant $α$ violates the second on purpose, so it keeps tracking a moving target.
Contrast with preference-based methods: Gradient bandit algorithms learn preferences $H_{t} (a)$ via a softmax and stochastic gradient ascent — they do not estimate action values, so they are not action-value methods (they are the bandit-level analogue of policy gradient).
Scaling up: in a full MDP the same “estimate values, derive policy” principle gives TD control methods SARSA (on-policy) and Q-Learning (off-policy), where the target becomes a bootstrapped return rather than a single reward.

Algorithm: Simple Bandit (ε-greedy Action-Value Method)
─────────────────────────────────────────────────────────
Initialize, for a = 1..k:
    Q(a) ← 0          # value estimate
    N(a) ← 0          # selection count
 
Loop forever:
    # --- Action selection (policy derived from Q) ---
    With probability ε:   A ← random action
    Otherwise:            A ← argmax_a Q(a)   (ties broken randomly)
 
    # --- Take action, observe reward ---
    R ← bandit(A)
 
    # --- Incremental value update ---
    N(A) ← N(A) + 1
    Q(A) ← Q(A) + (1 / N(A)) · [R − Q(A)]

Connections

Core setting: Multi-Armed Bandit (action-value methods are its standard solution)
Special case / scaled to: action-value function and Q(s a) in a full Markov Decision Process
Selection rules: Epsilon-Greedy Policy, Upper Confidence Bound, Optimistic Initial Values
Central tension: Exploration vs Exploitation
MDP successors: SARSA, Q-Learning, Expected SARSA (value-based TD control)
Contrasted with: Policy Gradient Methods (parameterise the policy directly), Softmax Policy / gradient bandits (learn preferences, not values)

Study Notes

Explorer

Action-Value Methods

Action-Value Methods

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks