On-Policy Distribution
On-Policy Distribution ( )
The On-Policy Distribution, denoted as , is the stationary distribution of states encountered while following a policy . It represents the fraction of time the agent spends in each state in the limit as time goes to infinity.
The Weighting
In function approximation, the on-policy distribution is used to weight the Mean Squared Value Error (MSVE):
where:
- — the probability of being in state under policy .
- — true value function.
- — approximate value function with parameters .
Importance Weighting
We cannot usually approximate the value function perfectly for all states. The on-policy distribution tells us which states are the most important to get right—namely, those we visit most often. If is high, an error in state contributes more to the total loss than an error in a state the agent rarely visits.
Properties in RL
- Self-Weighting: In on-policy learning, the updates naturally follow because states are sampled by interacting with the environment using .
- Objective Function: It defines the “average” error we are trying to minimize during gradient descent updates to the value function parameters.
Connections
- Used in: Value Function Approximation, Mean Squared Value Error (MSVE)
- Contrast with: Off-Policy Learning, which often uses an off-policy distribution (from a behavior policy).