Study Notes

❯

❯

Upper Confidence Bound

Upper Confidence Bound

Jun 06, 20261 min read

foundations

Upper Confidence Bound (UCB)

UCB Action Selection

$A_{t} = ar g max_{a} [Q_{t} (a) + c \frac{l n t}{N _{t} ( a )}]$

where:

$Q_{t} (a)$ — estimated value (exploitation term)

$c \frac{l n t}{N _{t} ( a )}$ — exploration bonus (decreases as action $a$ is tried more)

$N_{t} (a)$ — number of times action $a$ has been selected

$c > 0$ — controls degree of exploration

Optimism in the Face of Uncertainty

UCB adds a bonus to actions that haven’t been tried much. The less you know about an action, the higher its bonus. As you try it more, the bonus shrinks. This systematically explores uncertain options before settling on the best.

More principled than Epsilon-Greedy Policy — preferentially explores uncertain actions rather than exploring uniformly at random.

Appears In

RL-L01 - Intro, MDPs & Bandits, RL-Book Ch2 - Multi-Armed Bandits

Graph View

Upper Confidence Bound (UCB)
Appears In

Backlinks

Action-Value Methods
AlphaGo Zero
Decision-Time Planning
Epsilon-Greedy Policy
Exploration vs Exploitation
Monte Carlo Tree Search (MCTS)
Multi-Armed Bandit
RL-L12 - Model-Based RL
RL-L14 - Recap
RL - Overview

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community