On-Policy Learning

On-Policy Learning

Methods that evaluate and improve the same policy that is used to generate behavior (data collection). The agent learns about the policy it is currently following.

This is one side of the On-Policy vs Off-Policy distinction.

Intuition

In on-policy learning, you learn by doing — the data you collect comes from the policy you’re trying to improve. This means every update reflects your actual behavior, but you can’t learn from data generated by a different (e.g., exploratory) policy.

Key On-Policy Algorithms

Trade-offs

On-PolicyOff-Policy
Data fromCurrent policy Any policy (behavior policy)
StabilityMore stable (no Importance Sampling)Can diverge (Deadly Triad)
Data efficiencyLower (can’t reuse old data)Higher (replay buffers)
ExplorationMust balance within Separate behavior policy
ExamplesSARSA, REINFORCE, A2CQ-Learning, DQN

Connections

Appears In