Study Notes

❯

❯

Epsilon Greedy Policy

Epsilon-Greedy Policy

Jun 06, 20261 min read

foundations
exam-topic

Epsilon-Greedy Policy

ε-Greedy Policy

An ε-greedy policy selects the greedy action (highest estimated value) with probability $1 - ε$ , and a uniformly random action with probability $ε$ .

ε-Greedy Action Probabilities

$π (a ∣ s) = {1 - ε + \frac{ε}{∣ A ( s ) ∣} \frac{ε}{∣ A ( s ) ∣} if a = ar g max_{a^{'}} Q (s, a^{'}) otherwise$

Simplest method for Exploration vs Exploitation balance
ε-greedy is a special case of ε-soft policies (where $π (a ∣ s) \geq ε /∣ A ∣$ for all $a$ )
Common to decay $ε$ over time: high early (explore) → low later (exploit)

Connections

Used by: SARSA, Q-Learning, Monte Carlo Control, Deep Q-Network (DQN)
Alternative: Upper Confidence Bound, Boltzmann/softmax

Appears In

RL-L01 - Intro, MDPs & Bandits, RL-L03 - Monte Carlo Methods, RL-L04 - Temporal Difference Learning

Graph View

Epsilon-Greedy Policy
Connections
Appears In

Backlinks

Action-Value Methods
Every-Visit MC
Exploration vs Exploitation
Exploring Starts
Maximum Entropy RL
Monte Carlo Control
Monte Carlo Methods
Multi-Armed Bandit
On-Policy Learning
Policy
Softmax Policy
Upper Confidence Bound
RL-Book Ch10 - On-Policy Control with Approximation
RL-Book Ch5 - Monte Carlo Methods
RL-L14 - Recap
RL - Overview

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community