On-Policy Learning
On-Policy Learning
Methods that evaluate and improve the same policy that is used to generate behavior (data collection). The agent learns about the policy it is currently following.
This is one side of the On-Policy vs Off-Policy distinction.
Intuition
In on-policy learning, you learn by doing — the data you collect comes from the policy you’re trying to improve. This means every update reflects your actual behavior, but you can’t learn from data generated by a different (e.g., exploratory) policy.
Key On-Policy Algorithms
- SARSA — TD control using the action actually taken
- Monte Carlo Control with ε-greedy (without Exploring Starts)
- Semi-gradient SARSA with Function Approximation
- REINFORCE and Policy Gradient Methods
- Actor-Critic methods (on-policy variants)
Trade-offs
| On-Policy | Off-Policy | |
|---|---|---|
| Data from | Current policy | Any policy (behavior policy) |
| Stability | More stable (no Importance Sampling) | Can diverge (Deadly Triad) |
| Data efficiency | Lower (can’t reuse old data) | Higher (replay buffers) |
| Exploration | Must balance within | Separate behavior policy |
| Examples | SARSA, REINFORCE, A2C | Q-Learning, DQN |
Connections
- Paired with Off-Policy Learning
- Full comparison: On-Policy vs Off-Policy
- On-policy distribution matters for convergence with Function Approximation (see On-Policy Distribution)