Action-Value Methods
Definition
Action-Value Methods
Action-value methods estimate the value of each action — the expected reward (or return) of selecting it — and then use those estimates to drive action selection. The policy is implicit: it is derived from the value estimates (e.g. greedy or ε-greedy w.r.t. ), rather than being parameterised and learned directly. They are the canonical approach to the Multi-Armed Bandit problem and the conceptual ancestor of SARSA / Q-Learning in the full MDP setting.
Intuition
Estimate first, act second
The core loop is: keep a running estimate of how good each action is, then prefer the action that looks best (while still exploring). You never store a policy explicitly — you store numbers , and the policy is just “look at the numbers and pick”.
This is the natural contrast to Policy Gradient Methods: action-value methods learn and read off a policy; policy-gradient methods skip the values and adjust the policy parameters directly. In a bandit, “value of action ” is simply the expected reward ; in a full MDP it becomes the action-value function .
Mathematical Formulation
The true value of an action is its expected reward:
The sample-average estimate averages the rewards actually received for :
To avoid storing all past rewards, this is computed with the incremental update rule:
which has the general “error-correction” form
where:
- — action selected at step ; — reward received at step
- — true (unknown) expected reward of action
- — current estimate of at step
- — indicator, if action was taken at step , else
- — number of times has been selected
- — step size; the -th selection of the action uses step
- — the error between the latest reward (target) and the current estimate
By the Law of Large Numbers, as each action is sampled infinitely often. For nonstationary problems, replace with a constant step size :
giving an exponential recency-weighted average (recent rewards weighted more heavily):
Key Properties / Variants
- Selection rule is separate from estimation. Estimation gives ; a selection rule turns it into behaviour:
- Greedy: — pure exploitation, can lock onto a suboptimal arm.
- ε-greedy: greedy with prob. , uniform-random with prob. ; guarantees every action is sampled infinitely often so .
- UCB: — directs exploration toward uncertain actions instead of exploring blindly.
- Optimistic Initial Values: set high so early rewards “disappoint” and force trial of all actions; only aids initial exploration.
- Step-size convergence (stochastic approximation): estimates converge w.p. 1 iff and . Sample-average () satisfies both; constant violates the second on purpose, so it keeps tracking a moving target.
- Contrast with preference-based methods: Gradient bandit algorithms learn preferences via a softmax and stochastic gradient ascent — they do not estimate action values, so they are not action-value methods (they are the bandit-level analogue of policy gradient).
- Scaling up: in a full MDP the same “estimate values, derive policy” principle gives TD control methods SARSA (on-policy) and Q-Learning (off-policy), where the target becomes a bootstrapped return rather than a single reward.
Algorithm: Simple Bandit (ε-greedy Action-Value Method)
─────────────────────────────────────────────────────────
Initialize, for a = 1..k:
Q(a) ← 0 # value estimate
N(a) ← 0 # selection count
Loop forever:
# --- Action selection (policy derived from Q) ---
With probability ε: A ← random action
Otherwise: A ← argmax_a Q(a) (ties broken randomly)
# --- Take action, observe reward ---
R ← bandit(A)
# --- Incremental value update ---
N(A) ← N(A) + 1
Q(A) ← Q(A) + (1 / N(A)) · [R − Q(A)]Connections
- Core setting: Multi-Armed Bandit (action-value methods are its standard solution)
- Special case / scaled to: action-value function and Q(s a) in a full Markov Decision Process
- Selection rules: Epsilon-Greedy Policy, Upper Confidence Bound, Optimistic Initial Values
- Central tension: Exploration vs Exploitation
- MDP successors: SARSA, Q-Learning, Expected SARSA (value-based TD control)
- Contrasted with: Policy Gradient Methods (parameterise the policy directly), Softmax Policy / gradient bandits (learn preferences, not values)