On-Policy vs Off-Policy

On-Policy vs Off-Policy

  • On-policy: The agent learns about the same policy it uses to make decisions. The behavior policy equals the target policy .
  • Off-policy: The agent learns about a different policy (target ) from the one generating data (behavior ). Requires .

Comparison

PropertyOn-PolicyOff-Policy
ExampleSARSA, On-policy MCQ-Learning, Off-policy MC
Behavior = Target?Yes ()No ()
Needs IS correction?NoSometimes (Importance Sampling)
Can reuse old data?No (data becomes stale)Yes (with corrections)
ConvergenceGenerally more stableCan diverge with FA (Deadly Triad)
Explores how?Must be built into the policy (ε-greedy)Behavior policy can be anything exploratory

The Key Benefit of Off-Policy

Off-policy methods can learn the optimal policy while following an exploratory policy. They can also learn from demonstrations, old data, or other agents’ experience. This flexibility comes at the cost of potential instability.

Q-Learning's Trick

Q-Learning is off-policy but doesn’t need importance sampling for control. The in its update directly targets the greedy (optimal) policy. This is a special property of Q-learning — most off-policy methods need IS corrections.

Connections

Appears In