Study Notes

❯

❯

Expected SARSA

Jun 06, 20261 min read

tabular-methods
key-formula

Expected SARSA

Expected SARSA Update

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ \sum_{a} π (a ∣ S_{t + 1}) Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Between SARSA and Q-Learning

Instead of sampling a single next action $A_{t + 1}$ (SARSA) or taking the max (Q-Learning), Expected SARSA takes the expectation over all possible next actions under the current policy. This reduces variance compared to SARSA while being more general than Q-learning.

With a greedy target policy: Expected SARSA = Q-learning
With an ε-greedy target policy: Expected SARSA accounts for the exploration probability
Generally lower variance than SARSA, slightly more computation per step

Appears In

RL-L04 - Temporal Difference Learning, RL-Book Ch6 - Temporal-Difference Learning

Graph View

Expected SARSA
Appears In

Backlinks

Action-Value Methods
Q-Learning
SARSA
Temporal Difference Learning
RL-Book Ch6 - Temporal-Difference Learning
RL-L14 - Recap
RL - Overview

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community