Exploration vs Exploitation

Definition

Exploration vs Exploitation Dilemma

The fundamental trade-off in RL between:

  • Exploitation: Choosing the action with the highest estimated value (using current knowledge to maximize immediate reward)
  • Exploration: Choosing non-greedy actions to gather more information about their values (potentially discovering better long-term strategies)

The Restaurant Analogy

You know a good restaurant nearby (exploitation). But there might be an even better one you haven’t tried (exploration). Every night you eat at the known-good place, you miss the chance to discover something better. But every night you try somewhere random, you risk a bad meal. The optimal strategy is somewhere in between.

Why It Matters

  • If you only exploit: You may converge to a suboptimal policy because you never discovered better actions
  • If you only explore: You waste time on clearly bad actions and never capitalize on what you’ve learned
  • The optimal balance depends on the time horizon: more exploration early, more exploitation later

Methods for Balancing

MethodTypeKey Idea
Epsilon-Greedy PolicyAction selectionRandom action with probability
Upper Confidence BoundAction selectionBonus for uncertainty: prefer under-explored actions
Optimistic Initial ValuesInitializationHigh initial values drive early exploration
Gradient BanditPolicy gradientSoftmax over learned preferences
Exploring StartsEpisode initRandom starting state-action pairs
Boltzmann/SoftmaxAction selectionTemperature-controlled randomness

In Different RL Contexts

Connections

Appears In