Upper Confidence Bound (UCB)
UCB Action Selection
where:
- — estimated value (exploitation term)
- — exploration bonus (decreases as action is tried more)
- — number of times action has been selected
- — controls degree of exploration
Optimism in the Face of Uncertainty
UCB adds a bonus to actions that haven’t been tried much. The less you know about an action, the higher its bonus. As you try it more, the bonus shrinks. This systematically explores uncertain options before settling on the best.
More principled than Epsilon-Greedy Policy — preferentially explores uncertain actions rather than exploring uniformly at random.