Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). While classical DP is limited by the requirement of a model and computational cost, it provides the theoretical foundation for almost all other RL methods, which can be viewed as approximations of DP.
Key Assumption
DP assumes the environment is a finite MDP with known dynamics p(s′,r∣s,a).
1. Unified Notation & Recap
To handle both episodic and continuing tasks simultaneously, we use a single notation for Returns:
For episodic tasks, we consider termination as entering a special absorbing state that transitions only to itself with reward 0.
This allows us to use the infinite sum notation with the possibility of γ=1 if all episodes eventually terminate.
∑aπ(a∣s): Expectation over actions chosen by policy π.
∑s′,rp(s′,r∣s,a): Expectation over next states s′ and rewards r given action a (dynamics).
[r+γvπ(s′)]: Immediate reward plus discounted value of the next state.
Iterative Policy Evaluation
If the dynamics are known, the Bellman equation forms a system of ∣S∣ linear equations. We solve this iteratively.
Iterative Update Rule
vk+1(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γvk(s′)]
This is an expected update. The sequence {vk} converges to vπ as k→∞ (under γ<1 or guaranteed termination).
Algorithm: Iterative Policy Evaluation
Algorithm: Iterative Policy Evaluation, for estimating V ≈ v_π─────────────────────────────────────────────────────────────Input: π (policy to be evaluated)Parameter: θ > 0 (small threshold determining accuracy)Initialize V(s) arbitrarily (e.g., 0), for all s ∈ S; V(terminal) = 0Loop: Δ ← 0 For each s ∈ S: v ← V(s) V(s) ← Σ_a π(a|s) Σ_{s',r} p(s', r|s, a) [r + γ V(s')] Δ ← max(Δ, |v - V(s)|)until Δ < θ
Backup Diagram for vπ
(s) <-- State being updated / | \ o o o <-- Actions (a) chosen by policy π / \ (s') (s') <-- Possible next states (s') from dynamics p
The value flows back from the leaf nodes (s′) through the actions to the root (s).
3. Policy Improvement
Once we have vπ, we want to find a better policy π′.
Policy Improvement Theorem
Let π and π′ be any pair of deterministic policies such that, for all s∈S:
qπ(s,π′(s))≥vπ(s)
Then π′ must be as good as, or better than, π:
vπ′(s)≥vπ(s)∀s∈S
Proof Sketch
We start with vπ(s)≤qπ(s,π′(s)) and repeatedly expand the right side:
vπ(s)≤Eπ′[Rt+1+γvπ(St+1)∣St=s]
Apply the inequality again: vπ(s)≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s]
Continue expanding the expectation of rewards.
Eventually, the sum reaches vπ′(s).
Greedy Policy Improvement
A natural candidate for π′ is the greedy policy with respect to vπ:
π′(s)=arg maxaqπ(s,a)=arg maxa∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]
4. Policy Iteration
Policy Iteration is the alternating sequence of Evaluation and Improvement.
The PI Loop
π0Evπ0Iπ1Evπ1Iπ2E⋯Iπ∗Ev∗
Because a finite MDP has only a finite number of deterministic policies, this process must converge to an optimal policy in finite steps.
Algorithm: Policy Iteration
Algorithm: Policy Iteration (for finding π ≈ π*)─────────────────────────────────────────────────────────────1. Initialization V(s) ∈ ℝ and π(s) ∈ A(s) arbitrarily for all s ∈ S2. Policy Evaluation Loop: Δ ← 0 For each s ∈ S: v ← V(s) V(s) ← Σ_{s',r} p(s', r|s, π(s)) [r + γ V(s')] Δ ← max(Δ, |v - V(s)|) until Δ < θ3. Policy Improvement policy-stable ← true For each s ∈ S: old-action ← π(s) π(s) ← arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')] If old-action ≠ π(s), then policy-stable ← false If policy-stable, then stop; else go to step 2
5. Value Iteration
One drawback of Policy Iteration is that each evaluation step itself requires an iterative process. Value Iteration combines evaluation and improvement into a single sweep.
Value Iteration Update Rule
vk+1(s)=maxa∑s′,rp(s′,r∣s,a)[r+γvk(s′)]
This turns the Bellman Optimality Equation into an update rule.
Algorithm: Value Iteration
Algorithm: Value Iteration, for estimating π ≈ π*─────────────────────────────────────────────────────────────Initialize V arbitrarily (e.g., V(s) = 0), V(terminal) = 0Loop: Δ ← 0 For each s ∈ S: v ← V(s) V(s) ← max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')] Δ ← max(Δ, |v - V(s)|)until Δ < θOutput a deterministic policy π, such thatπ(s) = arg max_a Σ_{s',r} p(s', r|s, a) [r + γ V(s')]
Relation to Policy Iteration
Value Iteration is equivalent to Policy Iteration with only one sweep of policy evaluation between improvement steps.
6. Generalized Policy Iteration (GPI)
GPI is the general framework describing any interaction of policy evaluation and policy improvement.
Evaluation: Makes the value function consistent with the current policy.
Improvement: Makes the policy greedy with respect to the current value function.
The GPI Diagram
Imagine two lines (or manifolds) in the value-policy space:
v=vπ: where values are consistent with the policy.
π=greedy(v): where the policy is optimal for the values.
Evaluation pulls us toward line 1; Improvement pulls us toward line 2. They compete and eventually intersect at (v∗,π∗).
graph TD
E[Policy Evaluation] --> |Estimate V| I[Policy Improvement]
I --> |Greedy Policy| E
style E fill:#f9f,stroke:#333
style I fill:#ccf,stroke:#333
7. Asynchronous Dynamic Programming
A major drawback of classical DP is that it involves “sweeps” over the entire state set (updates all states). Asynchronous DP updates states in any order.
Key idea: Use whatever values are available.
Benefits: Can focus computation on “important” or frequently visited states. No need to wait for a full sweep to finish.
Requirement: Must continue to update all states (none can be ignored forever) for convergence to v∗.