Chapter 2: Multi-armed Bandits
Overview
This chapter explores the evaluative aspect of Reinforcement Learning in a simplified nonassociative setting. Unlike supervised learning (instructive feedback), RL uses evaluative feedback which indicates how good an action was, but not if it was the best possible. The k-armed bandit problem serves as the primary framework to introduce the fundamental conflict between exploration and exploitation.
2.1 A k-armed Bandit Problem
Multi-Armed Bandit Problem
You are faced repeatedly with a choice among different options (actions). After each choice, you receive a numerical reward from a stationary probability distribution depending on the action. The objective is to maximize the expected total reward over a time period (e.g., 1000 steps).
Key Variables
- : Action selected at time step .
- : Reward received at time step .
- : True value (expected reward) of action .
- : Estimated value of action at time .
Action Value
Exploration vs Exploitation
- Exploitation: Choosing the greedy action (the one with the highest current estimate ) to maximize immediate reward.
- Exploration: Choosing non-greedy actions to improve estimates of their true values.
- The Conflict: You cannot explore and exploit simultaneously with a single action. Balancing this is a central challenge in RL.
2.2 Action-value Methods
Methods that estimate action values and use them for selection.
Sample-average Method
Sample-Average Estimation
Where is 1 if true, 0 otherwise. By the Law of Large Numbers, as .
Action Selection Rules
- Greedy: .
- -greedy: Behave greedily most of the time, but with probability , select an action at random from all actions.
- Benefit: Ensures all actions are sampled infinitely, so converges to .
2.3 The 10-armed Testbed
The Testbed Setup
A suite of 2000 randomly generated 10-armed bandit problems.
- Rewards
Results summary:
- Greedy (): Improves slightly faster initially but levels off early at a suboptimal level. Often gets stuck on a suboptimal action (the “unlucky” start).
- : Explores more, finds the optimal action sooner, but levels off at (91%) optimal selection.
- : Improves more slowly but eventually outperforms .
2.4 Incremental Implementation
To avoid growing memory/computation requirements, we update averages incrementally.
Incremental Update Rule
General Form:
Pseudocode: Simple Bandit Algorithm
Initialize, for a = 1 to k:
Q(a) <- 0
N(a) <- 0
Loop forever:
# Action Selection
if random() < epsilon:
A <- random_action()
else:
A <- argmax(Q(a)) # break ties randomly
# Execution
R <- bandit(A)
N(A) <- N(A) + 1
# Update
Q(A) <- Q(A) + (1/N(A)) * (R - Q(A))2.5 Tracking a Nonstationary Problem
In nonstationary problems (reward distributions change over time), recent rewards should carry more weight.
Constant Step-Size Update
Results in an exponential recency-weighted average:
Convergence Conditions (Stochastic Approximation)
For estimates to converge with probability 1, the step-size sequence must satisfy:
- (Steps large enough to overcome initial conditions)
- (Steps small enough to ensure convergence) Note: Constant violates the second condition, allowing the estimate to follow nonstationary changes.
2.6 Optimistic Initial Values
Setting to a high value (e.g., +5 in the 10-armed testbed) forces the agent to explore.
- Mechanism: Any reward received is “disappointing” compared to the estimate, driving the agent to try all other actions.
- Limitation: Only helps with initial exploration; not useful for long-term nonstationarity.
2.7 Upper-Confidence-Bound (UCB) Action Selection
-greedy explores blindly. UCB explores by favoring actions that are uncertain.
UCB Action Selection
- : Controls degree of exploration.
- : Measure of uncertainty/variance in the estimate.
- Effect: As increases, the term grows for all actions, ensuring every action is tried eventually. As increases, the term shrinks.
2.8 Gradient Bandit Algorithms
Instead of estimating values, we learn numerical preferences .
Soft-max distribution
Action Probabilities
Update Rule (Stochastic Gradient Ascent)
Upon receiving reward :
Preference Update
: The average reward baseline. It reduces variance without changing the expected update.
2.9 Associative Search (Contextual Bandits)
In Contextual Bandits, the learner is given a “clue” or signal about the situation.
- Goal: Learn a policy (mapping from situation/context to action).
- Position: Intermediate between simple bandits (no context) and full RL (actions affect future state/context).
Comparison Summary
| Method | Key Parameter | Exploration Strategy |
|---|---|---|
| Greedy | None | Exploitation only |
| -greedy | Random selection | |
| Optimistic Initial Values | Initial “disappointment” forces trial | |
| UCB | Uncertainty-based | |
| Gradient Bandit | Soft-max preferences relative to baseline |
Summary of Parameter Study (Figure 2.6)
All algorithms show an “inverted-U” performance curve. UCB typically performs best on the stationary 10-armed testbed, but it is harder to generalize to large state spaces than -greedy or gradient methods.
Key Takeaways
- Evaluative feedback requires a balance of exploration and exploitation.
- Sample averages are for stationarity; constant step sizes are for non-stationarity.
- Optimistic initialization is a simple but limited trick for initial exploration.
- UCB provides a more sophisticated exploration based on uncertainty.
- Gradient bandits optimize preferences rather than estimating values directly.
- Baselines in gradient methods dramatically speed up learning by reducing variance.