RL Lecture 2: Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). While classical DP is limited by the requirement of a model and computational cost, it provides the theoretical foundation for almost all other RL methods, which can be viewed as approximations of DP.

Key Assumption

DP assumes the environment is a finite MDP with known dynamics .


1. Unified Notation & Recap

To handle both episodic and continuing tasks simultaneously, we use a single notation for Returns:

  • For episodic tasks, we consider termination as entering a special absorbing state that transitions only to itself with reward 0.
  • This allows us to use the infinite sum notation with the possibility of if all episodes eventually terminate.

Discounted Return

where:

  • — the total discounted return from time .
  • Discount Factor.
  • — reward at step .

2. Policy Evaluation (Prediction)

Policy Evaluation is the computation of the state-value function for an arbitrary policy .

Bellman Equation for

Term-by-term explanation:

  • : Value of state under policy .
  • : Expectation over actions chosen by policy .
  • : Expectation over next states and rewards given action (dynamics).
  • : Immediate reward plus discounted value of the next state.

Iterative Policy Evaluation

If the dynamics are known, the Bellman equation forms a system of linear equations. We solve this iteratively.

Iterative Update Rule

This is an expected update. The sequence converges to as (under or guaranteed termination).

Algorithm: Iterative Policy Evaluation

Algorithm: Iterative Policy Evaluation, for estimating V ≈ v_π
─────────────────────────────────────────────────────────────
Input: π (policy to be evaluated)
Parameter: θ > 0 (small threshold determining accuracy)
Initialize V(s) arbitrarily (e.g., 0), for all s ∈ S; V(terminal) = 0
 
Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← Σ_a π(a|s) Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
    Δ ← max(Δ, |v - V(s)|)
until Δ < θ

Backup Diagram for

       (s)          <-- State being updated
      / | \
     o  o  o        <-- Actions (a) chosen by policy π
    / \
  (s') (s')         <-- Possible next states (s') from dynamics p

The value flows back from the leaf nodes () through the actions to the root ().


3. Policy Improvement

Once we have , we want to find a better policy .

Policy Improvement Theorem

Let and be any pair of deterministic policies such that, for all : Then must be as good as, or better than, :

Proof Sketch

We start with and repeatedly expand the right side:

  1. Apply the inequality again:
  2. Continue expanding the expectation of rewards.
  3. Eventually, the sum reaches .

Greedy Policy Improvement

A natural candidate for is the greedy policy with respect to :


4. Policy Iteration

Policy Iteration is the alternating sequence of Evaluation and Improvement.

The PI Loop

Because a finite MDP has only a finite number of deterministic policies, this process must converge to an optimal policy in finite steps.

Algorithm: Policy Iteration

Algorithm: Policy Iteration (for finding π ≈ π*)
─────────────────────────────────────────────────────────────
1. Initialization
   V(s) ∈ ℝ and π(s) ∈ A(s) arbitrarily for all s ∈ S
 
2. Policy Evaluation
   Loop:
     Δ ← 0
     For each s ∈ S:
       v ← V(s)
       V(s) ← Σ_{s',r} p(s', r|s, π(s)) [r + γ V(s')]
       Δ ← max(Δ, |v - V(s)|)
   until Δ < θ
 
3. Policy Improvement
   policy-stable ← true
   For each s ∈ S:
     old-action ← π(s)
     π(s) ← arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
     If old-action ≠ π(s), then policy-stable ← false
   If policy-stable, then stop; else go to step 2

5. Value Iteration

One drawback of Policy Iteration is that each evaluation step itself requires an iterative process. Value Iteration combines evaluation and improvement into a single sweep.

Value Iteration Update Rule

This turns the Bellman Optimality Equation into an update rule.

Algorithm: Value Iteration

Algorithm: Value Iteration, for estimating π ≈ π*
─────────────────────────────────────────────────────────────
Initialize V arbitrarily (e.g., V(s) = 0), V(terminal) = 0
 
Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
    Δ ← max(Δ, |v - V(s)|)
until Δ < θ
 
Output a deterministic policy π, such that
π(s) = arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]

Relation to Policy Iteration

Value Iteration is equivalent to Policy Iteration with only one sweep of policy evaluation between improvement steps.


6. Generalized Policy Iteration (GPI)

GPI is the general framework describing any interaction of policy evaluation and policy improvement.

  • Evaluation: Makes the value function consistent with the current policy.
  • Improvement: Makes the policy greedy with respect to the current value function.

The GPI Diagram

Imagine two lines (or manifolds) in the value-policy space:

  1. : where values are consistent with the policy.
  2. : where the policy is optimal for the values. Evaluation pulls us toward line 1; Improvement pulls us toward line 2. They compete and eventually intersect at .
graph TD
    E[Policy Evaluation] --> |Estimate V| I[Policy Improvement]
    I --> |Greedy Policy| E
    style E fill:#f9f,stroke:#333
    style I fill:#ccf,stroke:#333

7. Asynchronous Dynamic Programming

A major drawback of classical DP is that it involves “sweeps” over the entire state set (updates all states). Asynchronous DP updates states in any order.

  • Key idea: Use whatever values are available.
  • Benefits: Can focus computation on “important” or frequently visited states. No need to wait for a full sweep to finish.
  • Requirement: Must continue to update all states (none can be ignored forever) for convergence to .

8. Summary of Bellman Equations

NameFormula
Bellman Expectation ()
Bellman Expectation ()
Bellman Optimality ()
Bellman Optimality ()

9. Examples from Lecture

9.1 Gridworld (Iterative Policy Evaluation)

A 4x4 grid where transitions have reward -1 until termination.

  • Under a random policy, the values represent the negative expected number of steps to the exit.
  • After a single policy improvement step, the greedy policy already becomes optimal for this simple grid.

9.2 Transition Graph Stochasticity

Recycling Robot

Arcs are labeled with : probability and reward .

  • If low battery, action search might transition to low (prob ) or result in “rescue” (transition to high, prob , reward -3).
  • This complexity is handled inherently by the expectation over .

Big Picture of Policy Learning

  1. Model-based: Learn dynamics then plan (DP).
  2. Model-free Value-based: Learn or directly (TD, Q-learning).
  3. Model-free Policy-based: Directly optimize (Policy Gradient).

Related Concepts:

Book Reference: Sutton & Barto Ch 3.4-3.8, Ch 4.