Expected SARSA
Expected SARSA Update
Between SARSA and Q-Learning
Instead of sampling a single next action (SARSA) or taking the max (Q-Learning), Expected SARSA takes the expectation over all possible next actions under the current policy. This reduces variance compared to SARSA while being more general than Q-learning.
- With a greedy target policy: Expected SARSA = Q-learning
- With an ε-greedy target policy: Expected SARSA accounts for the exploration probability
- Generally lower variance than SARSA, slightly more computation per step