Chapter 2: Multi-armed Bandits

Overview

This chapter explores the evaluative aspect of Reinforcement Learning in a simplified nonassociative setting. Unlike supervised learning (instructive feedback), RL uses evaluative feedback which indicates how good an action was, but not if it was the best possible. The k-armed bandit problem serves as the primary framework to introduce the fundamental conflict between exploration and exploitation.


2.1 A k-armed Bandit Problem

Multi-Armed Bandit Problem

You are faced repeatedly with a choice among different options (actions). After each choice, you receive a numerical reward from a stationary probability distribution depending on the action. The objective is to maximize the expected total reward over a time period (e.g., 1000 steps).

Key Variables

  • : Action selected at time step .
  • : Reward received at time step .
  • : True value (expected reward) of action .
  • : Estimated value of action at time .

Action Value

Exploration vs Exploitation

  • Exploitation: Choosing the greedy action (the one with the highest current estimate ) to maximize immediate reward.
  • Exploration: Choosing non-greedy actions to improve estimates of their true values.
  • The Conflict: You cannot explore and exploit simultaneously with a single action. Balancing this is a central challenge in RL.

2.2 Action-value Methods

Methods that estimate action values and use them for selection.

Sample-average Method

Sample-Average Estimation

Where is 1 if true, 0 otherwise. By the Law of Large Numbers, as .

Action Selection Rules

  1. Greedy: .
  2. -greedy: Behave greedily most of the time, but with probability , select an action at random from all actions.
    • Benefit: Ensures all actions are sampled infinitely, so converges to .

2.3 The 10-armed Testbed

The Testbed Setup

A suite of 2000 randomly generated 10-armed bandit problems.

  • Rewards

Results summary:

  • Greedy (): Improves slightly faster initially but levels off early at a suboptimal level. Often gets stuck on a suboptimal action (the “unlucky” start).
  • : Explores more, finds the optimal action sooner, but levels off at (91%) optimal selection.
  • : Improves more slowly but eventually outperforms .

2.4 Incremental Implementation

To avoid growing memory/computation requirements, we update averages incrementally.

Incremental Update Rule

General Form:

Pseudocode: Simple Bandit Algorithm

Initialize, for a = 1 to k:
    Q(a) <- 0
    N(a) <- 0
 
Loop forever:
    # Action Selection
    if random() < epsilon:
        A <- random_action()
    else:
        A <- argmax(Q(a)) # break ties randomly
    
    # Execution
    R <- bandit(A)
    N(A) <- N(A) + 1
    
    # Update
    Q(A) <- Q(A) + (1/N(A)) * (R - Q(A))

2.5 Tracking a Nonstationary Problem

In nonstationary problems (reward distributions change over time), recent rewards should carry more weight.

Constant Step-Size Update

Results in an exponential recency-weighted average:

Convergence Conditions (Stochastic Approximation)

For estimates to converge with probability 1, the step-size sequence must satisfy:

  1. (Steps large enough to overcome initial conditions)
  2. (Steps small enough to ensure convergence) Note: Constant violates the second condition, allowing the estimate to follow nonstationary changes.

2.6 Optimistic Initial Values

Setting to a high value (e.g., +5 in the 10-armed testbed) forces the agent to explore.

  • Mechanism: Any reward received is “disappointing” compared to the estimate, driving the agent to try all other actions.
  • Limitation: Only helps with initial exploration; not useful for long-term nonstationarity.

2.7 Upper-Confidence-Bound (UCB) Action Selection

-greedy explores blindly. UCB explores by favoring actions that are uncertain.

UCB Action Selection

  • : Controls degree of exploration.
  • : Measure of uncertainty/variance in the estimate.
  • Effect: As increases, the term grows for all actions, ensuring every action is tried eventually. As increases, the term shrinks.

2.8 Gradient Bandit Algorithms

Instead of estimating values, we learn numerical preferences .

Soft-max distribution

Action Probabilities

Update Rule (Stochastic Gradient Ascent)

Upon receiving reward :

Preference Update

: The average reward baseline. It reduces variance without changing the expected update.


2.9 Associative Search (Contextual Bandits)

In Contextual Bandits, the learner is given a “clue” or signal about the situation.

  • Goal: Learn a policy (mapping from situation/context to action).
  • Position: Intermediate between simple bandits (no context) and full RL (actions affect future state/context).

Comparison Summary

MethodKey ParameterExploration Strategy
GreedyNoneExploitation only
-greedyRandom selection
Optimistic Initial ValuesInitial “disappointment” forces trial
UCBUncertainty-based
Gradient BanditSoft-max preferences relative to baseline

Summary of Parameter Study (Figure 2.6)

All algorithms show an “inverted-U” performance curve. UCB typically performs best on the stationary 10-armed testbed, but it is harder to generalize to large state spaces than -greedy or gradient methods.


Key Takeaways

  1. Evaluative feedback requires a balance of exploration and exploitation.
  2. Sample averages are for stationarity; constant step sizes are for non-stationarity.
  3. Optimistic initialization is a simple but limited trick for initial exploration.
  4. UCB provides a more sophisticated exploration based on uncertainty.
  5. Gradient bandits optimize preferences rather than estimating values directly.
  6. Baselines in gradient methods dramatically speed up learning by reducing variance.