Value Function

Definition

Value Function

A value function estimates “how good” it is for an agent to be in a given state (or to take a given action in a state). “How good” is defined in terms of expected future rewards — specifically, the expected Return.

There are two types:

State-Value Function $v_{π} (s)$

State-Value Function

$v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} S_{t} = s]$

The expected Return when starting in state $s$ and following Policy $π$ thereafter.

Action-Value Function $q_{π} (s, a)$

Action-Value Function

$q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} S_{t} = s, A_{t} = a]$

The expected Return when starting in state $s$ , taking action $a$ , and following $π$ thereafter.

Relationship

$v_{π} (s) = \sum_{a \in A (s)} π (a ∣ s) q_{π} (s, a)$

$V$ vs $Q$

$v_{π} (s)$ : “How good is this state?” (averaged over what my policy would do)

$q_{π} (s, a)$ : “How good is taking this specific action in this state?”

For control (finding the best policy), we usually need $Q$ , because choosing $ar g max_{a} v_{π} (s^{'})$ requires knowing the model $p (s^{'} ∣ s, a)$ , while $ar g max_{a} q_{π} (s, a)$ doesn’t.

Optimal Value Functions

Optimal State-Value Function

$v_{*} (s) = max_{π} v_{π} (s) = max_{a} q_{*} (s, a)$

The best possible value of state $s$ under any policy.

Optimal Action-Value Function

$q_{*} (s, a) = max_{π} q_{π} (s, a) = E [R_{t + 1} + γ v_{*} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$

The best possible value of taking action $a$ in state $s$ .

If we know $q_{*}$ , the optimal Policy is trivially: $π_{*} (s) = ar g max_{a} q_{*} (s, a)$

Estimation Methods

Method	How it estimates $V$ or $Q$
Dynamic Programming	Solves Bellman Equation exactly (requires model)
Monte Carlo Methods	Averages sampled returns $G_{t}$
Temporal Difference Learning	Bootstraps: $V (s) \leftarrow V (s) + α [R + γV (s^{'}) - V (s)]$
Function Approximation	Parameterized $\overset{v}{^} (s, w)$ trained with SGD

Key Properties

Value functions satisfy the Bellman Equation (recursive relationship)
There exists a partial ordering over policies defined by value functions: $π \geq π^{'}$ iff $v_{π} (s) \geq v_{π^{'}} (s)$ for all $s$
At least one policy is better than or equal to all others — the optimal policy $π_{*}$
All optimal policies share the same $v_{*}$ and $q_{*}$

Tabular vs Approximate

In tabular settings, $V (s)$ is stored as a table with one entry per state. With Function Approximation, we use $\overset{v}{^} (s, w)$ — a parameterized function. The fundamental concept is the same, but convergence guarantees differ.

Connections

Defined on: Markov Decision Process
Recursive structure: Bellman Equation
Estimated by: Monte Carlo Methods, Temporal Difference Learning, Dynamic Programming
Approximated by: Function Approximation, Linear Function Approximation, Deep Q-Network (DQN)

Appears In

RL-L01 - Intro, MDPs & Bandits (definition, intuition)
RL-L02 - Dynamic Programming (computation)
RL-L05 - Tabular to Approximation (approximation)
All exercise sets

Study Notes

Explorer

Value Function

Value Function

Definition

State-Value Function $v_{π} (s)$

Action-Value Function $q_{π} (s, a)$

Relationship

Optimal Value Functions

Estimation Methods

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

Value Function

Value Function

Definition

State-Value Function vπ​(s)

Action-Value Function qπ​(s,a)

Relationship

Optimal Value Functions

Estimation Methods

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks

State-Value Function $v_{π} (s)$

Action-Value Function $q_{π} (s, a)$