On-Policy Distribution

On-Policy Distribution ( $μ$ )

The On-Policy Distribution, denoted as $μ (s)$ , is the stationary distribution of states encountered while following a policy $π$ . It represents the fraction of time the agent spends in each state $s$ in the limit as time goes to infinity.

The $μ (s)$ Weighting

In function approximation, the on-policy distribution is used to weight the Mean Squared Value Error (MSVE): $\overline{VE} (w) = \sum_{s \in S} μ (s) [v_{π} (s) - \overset{v}{^} (s, w)]^{2}$

where:

$μ (s)$ — the probability of being in state $s$ under policy $π$ .

$v_{π} (s)$ — true value function.

$\overset{v}{^} (s, w)$ — approximate value function with parameters $w$ .

Importance Weighting

We cannot usually approximate the value function perfectly for all states. The on-policy distribution tells us which states are the most important to get right—namely, those we visit most often. If $μ (s)$ is high, an error in state $s$ contributes more to the total loss than an error in a state the agent rarely visits.

Properties in RL

Self-Weighting: In on-policy learning, the updates naturally follow $μ (s)$ because states are sampled by interacting with the environment using $π$ .
Objective Function: It defines the “average” error we are trying to minimize during gradient descent updates to the value function parameters.

Connections

Used in: Value Function Approximation, Mean Squared Value Error (MSVE)
Contrast with: Off-Policy Learning, which often uses an off-policy distribution (from a behavior policy).

Appears In

RL-L05 - Tabular to Approximation

Study Notes

Explorer

On-Policy Distribution

On-Policy Distribution

Properties in RL

Connections

Appears In

Graph View

Table of Contents

Backlinks