Chapter 3: Finite Markov Decision Processes
Overview
This chapter formalizes the problem of sequential decision making through the framework of Finite Markov Decision Processes (MDPs). Unlike bandit problems, MDPs involve an associative aspect—choosing different actions in different situations—and account for delayed rewards, where current actions influence future states and thus future rewards.
3.1 The Agent–Environment Interface
The Reinforcement Learning problem is framed as a continuous interaction between two entities:
- Agent: The learner and decision-maker.
- Environment: Everything outside the agent.
The Interaction Loop
At each discrete time step :
- The agent receives a representation of the environment’s state, .
- The agent selects an action, .
- One time step later, the agent receives a numerical reward, , and finds itself in a new state .
This creates a trajectory:
Dynamics Function
In a Finite MDP, the sets and are finite. The environment’s dynamics are completely defined by the probability: for all , , and .
Derived Dynamics Functions
- State-transition probabilities:
- Expected rewards for state-action pairs:
- Expected rewards for state-action-next-state triples:
The Markov Property
The state must include information about all aspects of the past that matter for the future. If the transition probabilities depend only on the immediately preceding state and action, the state is said to have the Markov Property.
Recycling Robot
A robot collects empty cans. States: battery levels. Actions: .
- Searching retrieves cans (reward) but drains battery.
- Waiting retrieves fewer cans but saves battery.
- Recharging is only possible when battery is low.
- Transition Graph: State nodes (large circles) and Action nodes (small solid circles) show the probabilities and rewards .
3.2 Goals and Rewards
The agent’s goal is to maximize the cumulative reward in the long run.
The Reward Hypothesis
All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
Key Insight: Rewards should communicate what you want the agent to achieve, not how you want it achieved. Sub-goals (e.g., in Chess) should not be rewarded directly if they might conflict with the ultimate goal (winning).
3.3 Returns and Episodes
The Return, , is the specific function of the reward sequence the agent seeks to maximize.
Episodic Tasks
Tasks that break into finite subsequences ending in a terminal state.
Continuing Tasks
Tasks that go on without limit (). We must use a Discount Factor to keep the return finite.
Recursive Relation of Returns
3.4 Unified Notation
We unify episodic and continuing tasks by treating episode termination as entering a special absorbing state that transitions only to itself and generates zero rewards. This allows using the discounted sum formula for both.
3.5 Policies and Value Functions
A Policy, , is a mapping from states to probabilities of selecting each possible action: .
A Value Function estimates “how good” it is to be in a certain state or perform a certain action.
State-Value Function ( )
The expected return when starting in state and following policy :
Action-Value Function ( )
The expected return starting from , taking action , and thereafter following policy :
The Bellman Equation for
Values of states satisfy recursive relationships.
Bellman Equation
Backup Diagrams
Backup diagrams show the flow of value. For :
- Root node: State (open circle).
- Branches: Actions chosen via .
- Leaves: Next states (open circles) and rewards determined by . The value of is the average of the discounted values of successor states plus expected immediate rewards.
Gridworld
- Dynamics: 5x5 grid. Actions: N, S, E, W. Bumping into the edge results in -1 reward.
- Special States: Transitions from A give +10 and land in A’. Transitions from B give +5 and land in B’.
- Value Function: Under a random policy ( for all actions), states near edges have lower/negative values, while states near A have high values.
3.6 Optimal Policies and Optimal Value Functions
An Optimal Policy, , is better than or equal to all other policies ( for all ).
Optimal Value Functions
- Optimal State-Value:
- Optimal Action-Value:
- Note:
The Bellman Optimality Equation
Bellman Optimality Equation for
Bellman Optimality Equation for
Greedy Policies
Once is known, the optimal policy is greedy with respect to it. Because already accounts for all future rewards, a local one-step search is sufficient for global optimality. With , even the one-step search is unnecessary; the agent simply picks .
3.7 Optimality and Approximation
Finding the exact solution to the Bellman Optimality Equation is often impossible due to:
- Model Uncertainty: We don’t know the dynamics .
- Computational Constraints: The State Space is too large (e.g., Chess, Backgammon).
- Memory Constraints: We cannot store a table for all states.
RL methods focus on approximating and , often prioritizing frequently visited states to make the best use of limited resources.
Summary
- MDPs provide a mathematical framework for goal-directed learning.
- Value functions ( and ) are central, representing expected future returns.
- Bellman equations provide the recursive structure needed to compute these values.
- Optimality is an ideal reached through greedy behavior relative to optimal value functions.
- RL agents must usually settle for approximations due to the curse of dimensionality and unknown dynamics.