Chapter 3: Finite Markov Decision Processes

Overview

This chapter formalizes the problem of sequential decision making through the framework of Finite Markov Decision Processes (MDPs). Unlike bandit problems, MDPs involve an associative aspect—choosing different actions in different situations—and account for delayed rewards, where current actions influence future states and thus future rewards.

3.1 The Agent–Environment Interface

The Reinforcement Learning problem is framed as a continuous interaction between two entities:

Agent: The learner and decision-maker.
Environment: Everything outside the agent.

The Interaction Loop

At each discrete time step $t = 0, 1, 2, \dots$ :

The agent receives a representation of the environment’s state, $S_{t} \in S$ .
The agent selects an action, $A_{t} \in A (s)$ .
One time step later, the agent receives a numerical reward, $R_{t + 1} \in R \subset R$ , and finds itself in a new state $S_{t + 1}$ .

This creates a trajectory: $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, \dots$

Dynamics Function

In a Finite MDP, the sets $S, A,$ and $R$ are finite. The environment’s dynamics are completely defined by the probability: $p (s^{'}, r ∣ s, a) ≐ Pr {S_{t} = s^{'}, R_{t} = r ∣ S_{t - 1} = s, A_{t - 1} = a}$ for all $s^{'}, s \in S$ , $r \in R$ , and $a \in A (s)$ .

Derived Dynamics Functions

State-transition probabilities: $p (s^{'} ∣ s, a) = \sum_{r \in R} p (s^{'}, r ∣ s, a)$

Expected rewards for state-action pairs: $r (s, a) = \sum_{r \in R} r \sum_{s^{'} \in S} p (s^{'}, r ∣ s, a)$

Expected rewards for state-action-next-state triples: $r (s, a, s^{'}) = \sum_{r \in R} r \frac{p ( s ^{'} , r ∣ s , a )}{p ( s ^{'} ∣ s , a )}$

The Markov Property

The state $S_{t}$ must include information about all aspects of the past that matter for the future. If the transition probabilities depend only on the immediately preceding state and action, the state is said to have the Markov Property.

Recycling Robot

A robot collects empty cans. States: $S = {hi g h, l o w}$ battery levels. Actions: ${se a rc h, w ai t, rec ha r g e}$ .

Searching retrieves cans (reward) but drains battery.

Waiting retrieves fewer cans but saves battery.

Recharging is only possible when battery is low.

Transition Graph: State nodes (large circles) and Action nodes (small solid circles) show the probabilities $p (s^{'} ∣ s, a)$ and rewards $r$ .

3.2 Goals and Rewards

The agent’s goal is to maximize the cumulative reward in the long run.

The Reward Hypothesis

All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Key Insight: Rewards should communicate what you want the agent to achieve, not how you want it achieved. Sub-goals (e.g., in Chess) should not be rewarded directly if they might conflict with the ultimate goal (winning).

3.3 Returns and Episodes

The Return, $G_{t}$ , is the specific function of the reward sequence the agent seeks to maximize.

Episodic Tasks

Tasks that break into finite subsequences ending in a terminal state. $G_{t} ≐ R_{t + 1} + R_{t + 2} + \dots + R_{T}$

Continuing Tasks

Tasks that go on without limit ( $T = \infty$ ). We must use a Discount Factor $γ \in [0, 1]$ to keep the return finite. $G_{t} ≐ R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

Recursive Relation of Returns

$G_{t} = R_{t + 1} + γ G_{t + 1}$

3.4 Unified Notation

We unify episodic and continuing tasks by treating episode termination as entering a special absorbing state that transitions only to itself and generates zero rewards. This allows using the discounted sum formula for both.

3.5 Policies and Value Functions

A Policy, $π$ , is a mapping from states to probabilities of selecting each possible action: $π (a ∣ s)$ .

A Value Function estimates “how good” it is to be in a certain state or perform a certain action.

State-Value Function ( $v_{π}$ )

The expected return when starting in state $s$ and following policy $π$ : $v_{π} (s) ≐ E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} S_{t} = s]$

Action-Value Function ( $q_{π}$ )

The expected return starting from $s$ , taking action $a$ , and thereafter following policy $π$ : $q_{π} (s, a) ≐ E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

The Bellman Equation for $v_{π}$

Values of states satisfy recursive relationships.

Bellman Equation

$v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$

Backup Diagrams

Backup diagrams show the flow of value. For $v_{π}$ :

Root node: State $s$ (open circle).

Branches: Actions $a$ chosen via $π$ .

Leaves: Next states $s^{'}$ (open circles) and rewards $r$ determined by $p$ . The value of $s$ is the average of the discounted values of successor states plus expected immediate rewards.

Gridworld

Dynamics: 5x5 grid. Actions: N, S, E, W. Bumping into the edge results in -1 reward.

Special States: Transitions from A give +10 and land in A’. Transitions from B give +5 and land in B’.

Value Function: Under a random policy ( $π = 0.25$ for all actions), states near edges have lower/negative values, while states near A have high values.

3.6 Optimal Policies and Optimal Value Functions

An Optimal Policy, $π_{*}$ , is better than or equal to all other policies ( $π_{*} \geq π ⟺ v_{π_{*}} (s) \geq v_{π} (s)$ for all $s \in S$ ).

Optimal Value Functions

Optimal State-Value: $v_{*} (s) ≐ max_{π} v_{π} (s)$

Optimal Action-Value: $q_{*} (s, a) ≐ max_{π} q_{π} (s, a)$

Note: $q_{*} (s, a) = E [R_{t + 1} + γ v_{*} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$

The Bellman Optimality Equation

Bellman Optimality Equation for $v_{*}$

$v_{*} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{*} (s^{'})]$

Bellman Optimality Equation for $q_{*}$

$q_{*} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})]$

Greedy Policies

Once $v_{*}$ is known, the optimal policy is greedy with respect to it. Because $v_{*}$ already accounts for all future rewards, a local one-step search is sufficient for global optimality. With $q_{*}$ , even the one-step search is unnecessary; the agent simply picks $ar g max_{a} q_{*} (s, a)$ .

3.7 Optimality and Approximation

Finding the exact solution to the Bellman Optimality Equation is often impossible due to:

Model Uncertainty: We don’t know the dynamics $p$ .
Computational Constraints: The State Space is too large (e.g., Chess, Backgammon).
Memory Constraints: We cannot store a table for all states.

RL methods focus on approximating $v_{*}$ and $π_{*}$ , often prioritizing frequently visited states to make the best use of limited resources.

Summary

MDPs provide a mathematical framework for goal-directed learning.
Value functions ( $v$ and $q$ ) are central, representing expected future returns.
Bellman equations provide the recursive structure needed to compute these values.
Optimality is an ideal reached through greedy behavior relative to optimal value functions.
RL agents must usually settle for approximations due to the curse of dimensionality and unknown dynamics.

Study Notes

Explorer

RL-Book Ch3 - Finite MDPs

Chapter 3: Finite Markov Decision Processes

Overview

3.1 The Agent–Environment Interface

The Interaction Loop

3.2 Goals and Rewards

3.3 Returns and Episodes

Episodic Tasks

Continuing Tasks

3.4 Unified Notation

3.5 Policies and Value Functions

The Bellman Equation for $v_{π}$

3.6 Optimal Policies and Optimal Value Functions

The Bellman Optimality Equation

3.7 Optimality and Approximation

Summary

Graph View

Table of Contents

Backlinks

Study Notes

Explorer

RL-Book Ch3 - Finite MDPs

Chapter 3: Finite Markov Decision Processes

Overview

3.1 The Agent–Environment Interface

The Interaction Loop

3.2 Goals and Rewards

3.3 Returns and Episodes

Episodic Tasks

Continuing Tasks

3.4 Unified Notation

3.5 Policies and Value Functions

The Bellman Equation for vπ​

3.6 Optimal Policies and Optimal Value Functions

The Bellman Optimality Equation

3.7 Optimality and Approximation

Summary

Graph View

Table of Contents

Backlinks

The Bellman Equation for $v_{π}$