Policy
Definition
Policy
A policy is a mapping from states to probabilities of selecting each possible action. If an agent follows policy at time , then is the probability that given .
- Deterministic policy: — maps each state to exactly one action
- Stochastic policy: — a probability distribution over actions for each state, where
Types of Policies
Greedy Policy
Always picks the action with the highest estimated value. Pure exploitation, no exploration.
ε-Greedy Policy (Epsilon-Greedy Policy)
Mostly greedy, but with probability picks a random action. Balances Exploration vs Exploitation.
Softmax / Boltzmann Policy
Temperature controls exploration: high → uniform, low → greedy.
Optimal Policy
Optimal Policy
A policy is optimal if for all and all policies . There always exists at least one optimal policy for any finite MDP.
All optimal policies share the same optimal value functions and . Given :
Policy in Different RL Methods
| Method | How policy is used |
|---|---|
| Policy Iteration | Explicit policy, alternates evaluation and improvement |
| Value Iteration | Implicit policy (greedy w.r.t. current ) |
| Monte Carlo Methods | Generates episodes, improved via ε-greedy |
| SARSA | On-policy: follows and improves ε-greedy |
| Q-Learning | Off-policy: follows ε-greedy, learns about greedy |
| REINFORCE | Directly parameterized: |
On-Policy vs Off-Policy
Key Distinction
- Behavior policy : the policy used to generate data (select actions)
- Target policy : the policy being evaluated or improved
- On-policy: (same policy)
- Off-policy: (different policies, requires Importance Sampling correction)
See On-Policy vs Off-Policy for details.
Connections
- Acts within: Markov Decision Process
- Evaluated by: Value Function, Bellman Equation
- Improved by: Policy Iteration, Generalized Policy Iteration
- Parameterized: REINFORCE, Policy Gradient Theorem