Value Function
Definition
Value Function
A value function estimates “how good” it is for an agent to be in a given state (or to take a given action in a state). “How good” is defined in terms of expected future rewards — specifically, the expected Return.
There are two types:
State-Value Function
State-Value Function
Action-Value Function
Action-Value Function
The expected Return when starting in state , taking action , and following thereafter.
Relationship
vs
- : “How good is this state?” (averaged over what my policy would do)
- : “How good is taking this specific action in this state?”
For control (finding the best policy), we usually need , because choosing requires knowing the model , while doesn’t.
Optimal Value Functions
Optimal State-Value Function
The best possible value of state under any policy.
Optimal Action-Value Function
The best possible value of taking action in state .
If we know , the optimal Policy is trivially:
Estimation Methods
| Method | How it estimates or |
|---|---|
| Dynamic Programming | Solves Bellman Equation exactly (requires model) |
| Monte Carlo Methods | Averages sampled returns |
| Temporal Difference Learning | Bootstraps: |
| Function Approximation | Parameterized trained with SGD |
Key Properties
- Value functions satisfy the Bellman Equation (recursive relationship)
- There exists a partial ordering over policies defined by value functions: iff for all
- At least one policy is better than or equal to all others — the optimal policy
- All optimal policies share the same and
Tabular vs Approximate
In tabular settings, is stored as a table with one entry per state. With Function Approximation, we use — a parameterized function. The fundamental concept is the same, but convergence guarantees differ.
Connections
- Defined on: Markov Decision Process
- Recursive structure: Bellman Equation
- Estimated by: Monte Carlo Methods, Temporal Difference Learning, Dynamic Programming
- Approximated by: Function Approximation, Linear Function Approximation, Deep Q-Network (DQN)
Appears In
- RL-L01 - Intro, MDPs & Bandits (definition, intuition)
- RL-L02 - Dynamic Programming (computation)
- RL-L05 - Tabular to Approximation (approximation)
- All exercise sets