RL Lecture 2: Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). While classical DP is limited by the requirement of a model and computational cost, it provides the theoretical foundation for almost all other RL methods, which can be viewed as approximations of DP.

Key Assumption

DP assumes the environment is a finite MDP with known dynamics $p (s^{'}, r ∣ s, a)$ .

1. Unified Notation & Recap

To handle both episodic and continuing tasks simultaneously, we use a single notation for Returns:

For episodic tasks, we consider termination as entering a special absorbing state that transitions only to itself with reward 0.
This allows us to use the infinite sum notation with the possibility of $γ = 1$ if all episodes eventually terminate.

Discounted Return

$G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$ where:

$G_{t}$ — the total discounted return from time $t$ .

$γ \in [0, 1]$ — Discount Factor.

$R_{t + k + 1}$ — reward at step $t + k + 1$ .

2. Policy Evaluation (Prediction)

Policy Evaluation is the computation of the state-value function $v_{π}$ for an arbitrary policy $π$ .

Bellman Equation for $v_{π}$

$v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$ Term-by-term explanation:

$v_{π} (s)$ : Value of state $s$ under policy $π$ .

$\sum_{a} π (a ∣ s)$ : Expectation over actions chosen by policy $π$ .

$\sum_{s^{'}, r} p (s^{'}, r ∣ s, a)$ : Expectation over next states $s^{'}$ and rewards $r$ given action $a$ (dynamics).

$[r + γ v_{π} (s^{'})]$ : Immediate reward plus discounted value of the next state.

Iterative Policy Evaluation

If the dynamics are known, the Bellman equation forms a system of $∣ S ∣$ linear equations. We solve this iteratively.

Iterative Update Rule

$v_{k + 1} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{k} (s^{'})]$ This is an expected update. The sequence ${v_{k}}$ converges to $v_{π}$ as $k \to \infty$ (under $γ < 1$ or guaranteed termination).

Algorithm: Iterative Policy Evaluation

Algorithm: Iterative Policy Evaluation, for estimating V ≈ v_π
─────────────────────────────────────────────────────────────
Input: π (policy to be evaluated)
Parameter: θ > 0 (small threshold determining accuracy)
Initialize V(s) arbitrarily (e.g., 0), for all s ∈ S; V(terminal) = 0
 
Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← Σ_a π(a|s) Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
    Δ ← max(Δ, |v - V(s)|)
until Δ < θ

Backup Diagram for $v_{π}$
       (s)          <-- State being updated
      / | \
     o  o  o        <-- Actions (a) chosen by policy π
    / \
  (s') (s')         <-- Possible next states (s') from dynamics p
The value flows back from the leaf nodes ( $s^{'}$ ) through the actions to the root ( $s$ ).

3. Policy Improvement

Once we have $v_{π}$ , we want to find a better policy $π^{'}$ .

Policy Improvement Theorem

Let $π$ and $π^{'}$ be any pair of deterministic policies such that, for all $s \in S$ : $q_{π} (s, π^{'} (s)) \geq v_{π} (s)$ Then $π^{'}$ must be as good as, or better than, $π$ : $v_{π^{'}} (s) \geq v_{π} (s) \forall s \in S$

Proof Sketch

We start with $v_{π} (s) \leq q_{π} (s, π^{'} (s))$ and repeatedly expand the right side:

$v_{π} (s) \leq E_{π^{'}} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$
Apply the inequality again: $v_{π} (s) \leq E_{π^{'}} [R_{t + 1} + γ q_{π} (S_{t + 1}, π^{'} (S_{t + 1})) ∣ S_{t} = s]$
Continue expanding the expectation of rewards.
Eventually, the sum reaches $v_{π^{'}} (s)$ .

Greedy Policy Improvement

A natural candidate for $π^{'}$ is the greedy policy with respect to $v_{π}$ : $π^{'} (s) = arg max_{a} q_{π} (s, a) = arg max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$

4. Policy Iteration

Policy Iteration is the alternating sequence of Evaluation and Improvement.

The PI Loop

$π_{0} E v_{π_{0}} I π_{1} E v_{π_{1}} I π_{2} E \dots I π_{*} E v_{*}$ Because a finite MDP has only a finite number of deterministic policies, this process must converge to an optimal policy in finite steps.

Algorithm: Policy Iteration

Algorithm: Policy Iteration (for finding π ≈ π*)
─────────────────────────────────────────────────────────────
1. Initialization
   V(s) ∈ ℝ and π(s) ∈ A(s) arbitrarily for all s ∈ S
 
2. Policy Evaluation
   Loop:
     Δ ← 0
     For each s ∈ S:
       v ← V(s)
       V(s) ← Σ_{s',r} p(s', r|s, π(s)) [r + γ V(s')]
       Δ ← max(Δ, |v - V(s)|)
   until Δ < θ
 
3. Policy Improvement
   policy-stable ← true
   For each s ∈ S:
     old-action ← π(s)
     π(s) ← arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
     If old-action ≠ π(s), then policy-stable ← false
   If policy-stable, then stop; else go to step 2

5. Value Iteration

One drawback of Policy Iteration is that each evaluation step itself requires an iterative process. Value Iteration combines evaluation and improvement into a single sweep.

Value Iteration Update Rule

$v_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{k} (s^{'})]$ This turns the Bellman Optimality Equation into an update rule.

Algorithm: Value Iteration

Algorithm: Value Iteration, for estimating π ≈ π*
─────────────────────────────────────────────────────────────
Initialize V arbitrarily (e.g., V(s) = 0), V(terminal) = 0
 
Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
    Δ ← max(Δ, |v - V(s)|)
until Δ < θ
 
Output a deterministic policy π, such that
π(s) = arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]

Relation to Policy Iteration

Value Iteration is equivalent to Policy Iteration with only one sweep of policy evaluation between improvement steps.

6. Generalized Policy Iteration (GPI)

GPI is the general framework describing any interaction of policy evaluation and policy improvement.

Evaluation: Makes the value function consistent with the current policy.
Improvement: Makes the policy greedy with respect to the current value function.

The GPI Diagram
Imagine two lines (or manifolds) in the value-policy space:

$v = v_{π}$ : where values are consistent with the policy.

$π = greedy (v)$ : where the policy is optimal for the values. Evaluation pulls us toward line 1; Improvement pulls us toward line 2. They compete and eventually intersect at $(v_{*}, π_{*})$ .
graph TD
    E[Policy Evaluation] --> |Estimate V| I[Policy Improvement]
    I --> |Greedy Policy| E
    style E fill:#f9f,stroke:#333
    style I fill:#ccf,stroke:#333

7. Asynchronous Dynamic Programming

A major drawback of classical DP is that it involves “sweeps” over the entire state set (updates all states). Asynchronous DP updates states in any order.

Key idea: Use whatever values are available.
Benefits: Can focus computation on “important” or frequently visited states. No need to wait for a full sweep to finish.
Requirement: Must continue to update all states (none can be ignored forever) for convergence to $v_{*}$ .

8. Summary of Bellman Equations

Name	Formula
Bellman Expectation ( $v_{π}$ )	$v_{π} (s) = \sum_{a} π (a ∥ s) \sum_{s^{'}, r} p (s^{'}, r ∥ s, a) [r + γ v_{π} (s^{'})]$
Bellman Expectation ( $q_{π}$ )	$q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∥ s, a) [r + γ \sum_{a^{'}} π (a^{'} ∥ s^{'}) q_{π} (s^{'}, a^{'})]$
*Bellman Optimality ( $v_{}$ )**	$v_{} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∥ s, a) [r + γ v_{} (s^{'})]$
*Bellman Optimality ( $q_{}$ )**	$q_{} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∥ s, a) [r + γ max_{a^{'}} q_{} (s^{'}, a^{'})]$

9. Examples from Lecture

9.1 Gridworld (Iterative Policy Evaluation)

A 4x4 grid where transitions have reward -1 until termination.

Under a random policy, the values represent the negative expected number of steps to the exit.
After a single policy improvement step, the greedy policy already becomes optimal for this simple grid.

9.2 Transition Graph Stochasticity

Recycling Robot

Arcs are labeled with $(p, r)$ : probability $p$ and reward $r$ .

If low battery, action search might transition to low (prob $β$ ) or result in “rescue” (transition to high, prob $1 - β$ , reward -3).

This complexity is handled inherently by the expectation over $p (s^{'}, r ∣ s, a)$ .

Big Picture of Policy Learning

Model-based: Learn dynamics $p (s^{'}, r ∣ s, a)$ then plan (DP).

Model-free Value-based: Learn $V (s)$ or $Q (s, a)$ directly (TD, Q-learning).

Model-free Policy-based: Directly optimize $π (a ∣ s)$ (Policy Gradient).

Related Concepts:

Book Reference: Sutton & Barto Ch 3.4-3.8, Ch 4.

Study Notes

Explorer

RL-L02 - Dynamic Programming

RL Lecture 2: Dynamic Programming

1. Unified Notation & Recap

2. Policy Evaluation (Prediction)

Iterative Policy Evaluation

Algorithm: Iterative Policy Evaluation

3. Policy Improvement

Proof Sketch

Greedy Policy Improvement

4. Policy Iteration

Algorithm: Policy Iteration

5. Value Iteration

Algorithm: Value Iteration

6. Generalized Policy Iteration (GPI)

7. Asynchronous Dynamic Programming

8. Summary of Bellman Equations

9. Examples from Lecture

9.1 Gridworld (Iterative Policy Evaluation)

9.2 Transition Graph Stochasticity

Graph View

Table of Contents

Backlinks