On-Policy Learning

On-Policy Learning

Methods that evaluate and improve the same policy that is used to generate behavior (data collection). The agent learns about the policy it is currently following.

This is one side of the On-Policy vs Off-Policy distinction.

Intuition

In on-policy learning, you learn by doing — the data you collect comes from the policy you’re trying to improve. This means every update reflects your actual behavior, but you can’t learn from data generated by a different (e.g., exploratory) policy.

Key On-Policy Algorithms

SARSA — TD control using the action actually taken
Monte Carlo Control with ε-greedy (without Exploring Starts)
Semi-gradient SARSA with Function Approximation
REINFORCE and Policy Gradient Methods
Actor-Critic methods (on-policy variants)

Trade-offs

	On-Policy	Off-Policy
Data from	Current policy $π$	Any policy $b$ (behavior policy)
Stability	More stable (no Importance Sampling)	Can diverge (Deadly Triad)
Data efficiency	Lower (can’t reuse old data)	Higher (replay buffers)
Exploration	Must balance within $π$	Separate behavior policy
Examples	SARSA, REINFORCE, A2C	Q-Learning, DQN

Connections

Paired with Off-Policy Learning
Full comparison: On-Policy vs Off-Policy
On-policy distribution matters for convergence with Function Approximation (see On-Policy Distribution)

Study Notes

Explorer

On-Policy Learning

On-Policy Learning

Intuition

Key On-Policy Algorithms

Trade-offs

Connections

Appears In

Graph View

Table of Contents

Backlinks