Chapter 2: Multi-armed Bandits

Overview

This chapter explores the evaluative aspect of Reinforcement Learning in a simplified nonassociative setting. Unlike supervised learning (instructive feedback), RL uses evaluative feedback which indicates how good an action was, but not if it was the best possible. The k-armed bandit problem serves as the primary framework to introduce the fundamental conflict between exploration and exploitation.

2.1 A k-armed Bandit Problem

Multi-Armed Bandit Problem

You are faced repeatedly with a choice among $k$ different options (actions). After each choice, you receive a numerical reward from a stationary probability distribution depending on the action. The objective is to maximize the expected total reward over a time period (e.g., 1000 steps).

Key Variables

$A_{t}$ : Action selected at time step $t$ .
$R_{t}$ : Reward received at time step $t$ .
$q_{*} (a)$ : True value (expected reward) of action $a$ .
$Q_{t} (a)$ : Estimated value of action $a$ at time $t$ .

Action Value

$q_{*} (a) ≐ E [R_{t} ∣ A_{t} = a]$

Exploration vs Exploitation

Exploitation: Choosing the greedy action (the one with the highest current estimate $Q_{t} (a)$ ) to maximize immediate reward.
Exploration: Choosing non-greedy actions to improve estimates of their true values.
The Conflict: You cannot explore and exploit simultaneously with a single action. Balancing this is a central challenge in RL.

2.2 Action-value Methods

Methods that estimate action values and use them for selection.

Sample-average Method

Sample-Average Estimation

$Q_{t} (a) ≐ \frac{\sum _{i = 1}^{t - 1} R _{i} \cdot 1 _{A_{i} = a}}{\sum _{i = 1}^{t - 1} 1 _{A_{i} = a}}$ Where $1_{p re d i c a t e}$ is 1 if true, 0 otherwise. By the Law of Large Numbers, $Q_{t} (a) \to q_{*} (a)$ as $t \to \infty$ .

Action Selection Rules

Greedy: $A_{t} ≐ argmax_{a} Q_{t} (a)$ .
$ε$ -greedy: Behave greedily most of the time, but with probability $ε$ , select an action at random from all $k$ actions.
- Benefit: Ensures all actions are sampled infinitely, so $Q_{t} (a)$ converges to $q_{*} (a)$ .

2.3 The 10-armed Testbed

The Testbed Setup

A suite of 2000 randomly generated 10-armed bandit problems.

$q_{*} (a) \sim N (0, 1)$

Rewards $R_{t} \sim N (q_{*} (A_{t}), 1)$

Results summary:

Greedy ( $ε = 0$ ): Improves slightly faster initially but levels off early at a suboptimal level. Often gets stuck on a suboptimal action (the “unlucky” start).
$ε = 0.1$ : Explores more, finds the optimal action sooner, but levels off at $1 - ε$ (91%) optimal selection.
$ε = 0.01$ : Improves more slowly but eventually outperforms $ε = 0.1$ .

2.4 Incremental Implementation

To avoid growing memory/computation requirements, we update averages incrementally.

Incremental Update Rule

$Q_{n + 1} = Q_{n} + \frac{1}{n} [R_{n} - Q_{n}]$ General Form: $NewEstimate \leftarrow OldEstimate + StepSize [Target - OldEstimate]$

Pseudocode: Simple Bandit Algorithm

Initialize, for a = 1 to k:
    Q(a) <- 0
    N(a) <- 0
 
Loop forever:
    # Action Selection
    if random() < epsilon:
        A <- random_action()
    else:
        A <- argmax(Q(a)) # break ties randomly
    
    # Execution
    R <- bandit(A)
    N(A) <- N(A) + 1
    
    # Update
    Q(A) <- Q(A) + (1/N(A)) * (R - Q(A))

2.5 Tracking a Nonstationary Problem

In nonstationary problems (reward distributions change over time), recent rewards should carry more weight.

Constant Step-Size Update

$Q_{n + 1} ≐ Q_{n} + α [R_{n} - Q_{n}]$ Results in an exponential recency-weighted average: $Q_{n + 1} = (1 - α)^{n} Q_{1} + \sum_{i = 1}^{n} α (1 - α)^{n - i} R_{i}$

Convergence Conditions (Stochastic Approximation)

For estimates to converge with probability 1, the step-size sequence ${α_{n} (a)}$ must satisfy:

$\sum_{n = 1}^{\infty} α_{n} (a) = \infty$ (Steps large enough to overcome initial conditions)
$\sum_{n = 1}^{\infty} α_{n}^{2} (a) < \infty$ (Steps small enough to ensure convergence) Note: Constant $α$ violates the second condition, allowing the estimate to follow nonstationary changes.

2.6 Optimistic Initial Values

Setting $Q_{1} (a)$ to a high value (e.g., +5 in the 10-armed testbed) forces the agent to explore.

Mechanism: Any reward received is “disappointing” compared to the estimate, driving the agent to try all other actions.
Limitation: Only helps with initial exploration; not useful for long-term nonstationarity.

2.7 Upper-Confidence-Bound (UCB) Action Selection

$ε$ -greedy explores blindly. UCB explores by favoring actions that are uncertain.

UCB Action Selection

$A_{t} ≐ argmax_{a} [Q_{t} (a) + c \frac{l n t}{N _{t} ( a )}]$

$c$ : Controls degree of exploration.

$\dots$ : Measure of uncertainty/variance in the estimate.

Effect: As $t$ increases, the term grows for all actions, ensuring every action is tried eventually. As $N_{t} (a)$ increases, the term shrinks.

2.8 Gradient Bandit Algorithms

Instead of estimating values, we learn numerical preferences $H_{t} (a)$ .

Soft-max distribution

Action Probabilities

$Pr {A_{t} = a} ≐ \frac{e ^{H_{t} (a)}}{\sum _{b = 1}^{k} e ^{H_{t} (b)}} ≐ π_{t} (a)$

Update Rule (Stochastic Gradient Ascent)

Upon receiving reward $R_{t}$ :

Preference Update

$H_{t + 1} (A_{t}) ≐ H_{t} (A_{t}) + α (R_{t} - \overset{ˉ}{R}_{t}) (1 - π_{t} (A_{t}))$ $H_{t + 1} (a) ≐ H_{t} (a) - α (R_{t} - \overset{ˉ}{R}_{t}) π_{t} (a) for a \neq = A_{t}$ $\overset{ˉ}{R}_{t}$ : The average reward baseline. It reduces variance without changing the expected update.

2.9 Associative Search (Contextual Bandits)

In Contextual Bandits, the learner is given a “clue” or signal about the situation.

Goal: Learn a policy (mapping from situation/context to action).
Position: Intermediate between simple bandits (no context) and full RL (actions affect future state/context).

Comparison Summary

Method	Key Parameter	Exploration Strategy
Greedy	None	Exploitation only
$ε$ -greedy	$ε$	Random selection
Optimistic Initial Values	$Q_{1}$	Initial “disappointment” forces trial
UCB	$c$	Uncertainty-based
Gradient Bandit	$α$	Soft-max preferences relative to baseline

Summary of Parameter Study (Figure 2.6)

All algorithms show an “inverted-U” performance curve. UCB typically performs best on the stationary 10-armed testbed, but it is harder to generalize to large state spaces than $ε$ -greedy or gradient methods.

Key Takeaways

Evaluative feedback requires a balance of exploration and exploitation.
Sample averages are for stationarity; constant step sizes are for non-stationarity.
Optimistic initialization is a simple but limited trick for initial exploration.
UCB provides a more sophisticated exploration based on uncertainty.
Gradient bandits optimize preferences rather than estimating values directly.
Baselines in gradient methods dramatically speed up learning by reducing variance.

Study Notes

Explorer

RL-Book Ch2 - Multi-Armed Bandits

Chapter 2: Multi-armed Bandits

Overview

2.1 A k-armed Bandit Problem

Key Variables

Exploration vs Exploitation

2.2 Action-value Methods

Sample-average Method

Action Selection Rules

2.3 The 10-armed Testbed

2.4 Incremental Implementation

Pseudocode: Simple Bandit Algorithm

2.5 Tracking a Nonstationary Problem

Convergence Conditions (Stochastic Approximation)

2.6 Optimistic Initial Values

2.7 Upper-Confidence-Bound (UCB) Action Selection

2.8 Gradient Bandit Algorithms

Soft-max distribution

Update Rule (Stochastic Gradient Ascent)

2.9 Associative Search (Contextual Bandits)

Comparison Summary

Key Takeaways

Graph View

Table of Contents

Backlinks