Policy

Definition

Policy

A policy $π$ is a mapping from states to probabilities of selecting each possible action. If an agent follows policy $π$ at time $t$ , then $π (a ∣ s)$ is the probability that $A_{t} = a$ given $S_{t} = s$ .

$π (a ∣ s) = P (A_{t} = a ∣ S_{t} = s)$

Deterministic policy: $π (s) = a$ — maps each state to exactly one action
Stochastic policy: $π (a ∣ s) \in [0, 1]$ — a probability distribution over actions for each state, where $\sum_{a} π (a ∣ s) = 1$

Types of Policies

Greedy Policy

$π (s) = ar g max_{a} q (s, a)$ Always picks the action with the highest estimated value. Pure exploitation, no exploration.

ε-Greedy Policy (Epsilon-Greedy Policy)

$π (a ∣ s) = {1 - ε + \frac{ε}{∣ A ( s ) ∣} \frac{ε}{∣ A ( s ) ∣} if a = ar g max_{a^{'}} q (s, a^{'}) otherwise$ Mostly greedy, but with probability $ε$ picks a random action. Balances Exploration vs Exploitation.

Softmax / Boltzmann Policy

$π (a ∣ s) = \frac{e ^{q (s, a) / τ}}{\sum _{a^{'}} e ^{q (s, a^{'}) / τ}}$ Temperature $τ$ controls exploration: high $τ$ → uniform, low $τ$ → greedy.

Optimal Policy

Optimal Policy

A policy $π_{*}$ is optimal if $v_{π_{*}} (s) \geq v_{π} (s)$ for all $s \in S$ and all policies $π$ . There always exists at least one optimal policy for any finite MDP.

All optimal policies share the same optimal value functions $v_{*}$ and $q_{*}$ . Given $q_{*}$ : $π_{*} (s) = ar g max_{a} q_{*} (s, a)$

Policy in Different RL Methods

Method	How policy is used
Policy Iteration	Explicit policy, alternates evaluation and improvement
Value Iteration	Implicit policy (greedy w.r.t. current $V$ )
Monte Carlo Methods	Generates episodes, improved via ε-greedy
SARSA	On-policy: follows and improves ε-greedy
Q-Learning	Off-policy: follows ε-greedy, learns about greedy
REINFORCE	Directly parameterized: $π_{θ} (a ∥ s)$

On-Policy vs Off-Policy

Key Distinction

Behavior policy $b$ : the policy used to generate data (select actions)

Target policy $π$ : the policy being evaluated or improved

On-policy: $b = π$ (same policy)

Off-policy: $b \neq = π$ (different policies, requires Importance Sampling correction)

See On-Policy vs Off-Policy for details.

Connections

Acts within: Markov Decision Process
Evaluated by: Value Function, Bellman Equation
Improved by: Policy Iteration, Generalized Policy Iteration
Parameterized: REINFORCE, Policy Gradient Theorem

Appears In

RL-L01 - Intro, MDPs & Bandits, RL-L02 - Dynamic Programming, RL-L03 - Monte Carlo Methods, RL-L04 - Temporal Difference Learning

Study Notes

Explorer

Policy

Policy

Definition

Types of Policies

Greedy Policy

ε-Greedy Policy (Epsilon-Greedy Policy)

Softmax / Boltzmann Policy

Optimal Policy

Policy in Different RL Methods

On-Policy vs Off-Policy

Connections

Appears In

Graph View

Table of Contents

Backlinks