RL Lecture 14 - Recap & Exam Preparation

Overview & Motivation

This is the final recap lecture covering all 13 previous lectures. The goal is to sketch the coherence of all lecture topics and provide a unified view of the methods studied throughout the course. This note is structured as an exam-preparation resource with comparison tables, convergence guarantees, and method taxonomies.

Warning

The recap cannot cover all 14 lectures in 90 minutes. Topics outside of the recap can be on the exam too. Study each lecture’s material independently.

Exam: What to Know

Know the advantages, disadvantages, and limitations of each method, and the situations where a certain method should be preferred. A cheat sheet with key update equations is provided during the exam. Many algorithms have variants (Q- and V-version, importance weights, etc.) — the cheat sheet has only the most important ones.


The Big Picture

The central problem in RL: find an optimal policy (or evaluate a given policy) from data.

When MDP dynamics are known, we use Dynamic Programming. When we only have data from the MDP, we use reinforcement learning.

Three Approaches to Learning Policies

ApproachWhat We LearnMethods
Value-based or , derive policy from valuesMC, TD, Q-Learning, SARSA, DQN, CQL
Policy-based directlyREINFORCE, PGT, Actor-Critic, DPG/DDPG, SAC
Model-based and , then planDyna, MCTS, AlphaGo Zero

Other approaches: Decision Transformer, Decision Diffuser.


Taxonomy of RL Methods

                        RL Methods
                            |
            ________________|________________
           |                |                |
      Value-Based      Policy-Based     Model-Based
           |                |                |
    MC, TD, Q-learning  REINFORCE       Dyna, MCTS
    SARSA, DQN, CQL    PGT, AC, DPG    AlphaGo Zero
                        SAC

Model-Free vs Model-Based

Model-FreeModel-Based
What it learnsValue function or policy directlyTransition model and reward
PlanningNo explicit planningUses model for planning / simulation
Sample efficiencyLowerHigher (can generate synthetic data)
Model errorsNo model biasModel errors compound
ExamplesQ-Learning, SARSA, REINFORCE, Actor-CriticDyna, MCTS, AlphaGo Zero

Lecture-by-Lecture Summary

L1: MDPs, Bandits, Exploration vs Exploitation

L2: Dynamic Programming

L3: Monte Carlo Methods

L4: Temporal Difference Learning

L5: From Tabular to Approximation

L6: On-Policy TD with Approximation

L7: Off-Policy RL with Approximation

L8: Deep RL (Value-Based)

L9: Policy Gradient Methods

L11: SAC, Decision Transformer, Decision Diffuser

  • Soft Actor-Critic (SAC): maximum entropy RL, balances exploration and exploitation
  • Decision Transformer: sequence modeling approach to RL (uses transformers)
  • Decision Diffuser: diffusion models for decision-making

L12: Model-Based RL

  • Dyna: integrates learning and planning, generates simulated experience
  • Monte Carlo Tree Search (MCTS): tree search with rollout-based evaluation
  • AlphaGo Zero: combines MCTS with deep neural network (policy + value)
  • Key questions: How to learn the model? When to update? How to update?

L13: POMDPs

  • Partially Observable MDPs: agent does not observe full state
  • Exact methods: full history (not compact), belief states (requires known model), predictive state representations (most compact, learnable)
  • Approximate methods: recent observations (easy, loses long-term info), end-to-end learning with RNNs (general but tricky, data-hungry)

Value-Based Methods: MC vs TD

MC vs TD Comparison

PropertyMCTD
BiasUnbiasedBiased (bootstrapping)
VarianceHighLow
Episode requirementMust wait for episode endUpdates every step
BootstrappingNoYes
Works in continuing tasksNo (episodic only)Yes
Sensitivity to initial valuesLess sensitiveMore sensitive
Convergence (tabular)Converges to Converges to

On-Policy vs Off-Policy

PropertyOn-PolicyOff-Policy
ComplexitySimplerMore complex
GeneralitySpecific caseMore general
ConvergenceOften converges fasterOften large variance or slow convergence
Data usageOnly data from current policyCan reuse data, use data from other sources
Policy typeGenerally needs non-greedy policyAllows greedy target policy
ExamplesSARSA, MC controlQ-Learning, DQN, DPG

Evaluation Methods (Value Prediction)

The unified view of evaluation methods along two axes (Sutton and Barto figure):

  • Width (sampling): from one sample (TD) to all samples (DP)
  • Depth (bootstrapping): from one-step (TD) to full return (MC)

Key methods: Gradient MC, semi-gradient TD(0), GTD2, LSTD


Control Methods

On-PolicyOff-Policy
TabularSARSA, MC ControlQ-Learning
With ApproximationSemi-gradient SARSADQN, CQL
DP (known model)Policy EvaluationValue Iteration
Off-policy with IS(SARSA with importance weights)

Convergence with Function Approximation

This table is critical for the exam. Know which combinations converge and which do not.

Convergence Guarantees

MethodTabular (On/Off)Linear On-PolicyNonlinear On-PolicyLinear Off-PolicyNonlinear Off-Policy
Gradient MCYesYes (global)Yes (local)Yes (global)Yes (local)
Semi-gradient TDYesYes (global)No C!No C!No C!
Gradient TD (e.g. GTD2)YesYes (global)Yes (local)Yes (global)Yes (local)
LSTDYesYes (global)N.A.Yes (global)N.A.

Notes:

  • * with appropriate step-size schedule
  • ** linear convergence assumes features are independent with a single solution
  • “No C!” = No convergence guarantee
  • “local” = converges to local optimum only (nonlinear case)
  • “global” = converges to global optimum

Exam Pattern

Semi-gradient TD diverges in off-policy settings with function approximation. This is the Deadly Triad: function approximation + bootstrapping + off-policy. Gradient TD methods (GTD2) fix this by using true gradients.

Convergence to Which Error?

MethodError Minimized
Gradient MCVE (Value Error):
Semi-gradient TDPBE (Projected Bellman Error) — when it converges
Gradient TD (GTD2)PBE (Projected Bellman Error)
LSTDPBE (Projected Bellman Error)

Errors in Value Function Approximation

Error Hierarchy (Lecture 7)

  • Value Error (VE): — difference between true and estimated value
  • Bellman Error (BE):
  • TD Error: — sample of Bellman error
  • Projected Bellman Error (PBE): — BE projected onto representable space

Semi-Gradient Methods: Why “Semi”?

Semi-Gradient Methods take the gradient only through the estimate , not through the bootstrapping target . This means they are not true gradient methods and lack the convergence guarantees of true gradient descent. They converge on-policy with linear function approximation but can diverge off-policy.


Policy-Based Methods

Taxonomy of Policy Methods

CategoryMethodsKey Idea
Actor onlyREINFORCE (original & v2), Finite differencesPolicy gradient without a critic
Actor-CriticPGT-based Actor-Critic, DPG, SACPolicy gradient with a learned value function
Critic onlyQ-Learning, DQN, CQLDerive policy from learned value function
OtherDecision Transformer, Decision DiffuserSequence modeling / generative approaches

Methods and Action/Policy Types

MethodDiscrete ActionsContinuous ActionsStochastic PoliciesDeterministic Policies
Stochastic PGYesYesYes (behavior + target)No
Deterministic PGNo (no gradients)YesOnly behavior policyYes (target policy)
Critic-only evaluationYesYesBehavior or targetOnly target policy
Critic-only controlYesNo (how to extract policy?)

Key Distinction

Stochastic policy gradients work for both discrete and continuous actions but require stochastic policies. Deterministic policy gradients require continuous actions but avoid importance sampling and can learn a true greedy policy.


RL Methods Landscape

The RL methods landscape can be organized along two axes (from Jan Peters):

  • x-axis: Parametrized (or given) transition model (model-free to model-based)
  • y-axis: Parametrized vs parametrized
Model-FreeModel-Based
Parametrized Q-Learning, DQN, …
Both and Actor-Critic, SACModel-based policy search
Parametrized REINFORCE
PlanningDyna, AlphaGo, Pure planning

Model-Based Learning (Lecture 12)

Key questions addressed:

  • Why do model-based RL? Sample efficiency, ability to plan ahead
  • How to learn the model? Supervised learning of transitions and rewards
  • When to update the policy? After each real step, after batches, etc.
  • How to update the policy? Using simulated experience (Dyna) or tree search (MCTS)
  • AlphaGo Zero: leverages planning both “ahead of” and “while” acting (MCTS during play, neural network training from self-play)

POMDPs (Lecture 13)

State update functions for internal states in Partially Observable MDPs:

MethodTypeCompact?Markovian?Notes
Full historyExactNoYesTrivially Markovian but grows without bound
Belief stateExactModerateYesEasy to interpret, requires known model
Predictive stateExactMost compactYesModel learnable from data
Recent observation(s)ApproximateYesNoEasy, but loses long-term dependencies
End-to-end (RNN)ApproximateYesLearnedGeneral, but RNN training is tricky and data-hungry

Key Recurring Themes

Themes That Run Through the Entire Course

ThemeDescriptionRelevant Lectures
Bias-Variance Trade-offMC is unbiased/high variance; TD is biased/low varianceL3, L4, L5, L6, L10
On-Policy vs Off-PolicyLearning from own policy vs learning from other dataL3, L4, L7, L8, L10
Exploration vs ExploitationBalancing trying new actions vs using known good onesL1 and throughout
Tabular vs ApproximationExact methods vs generalization with function approximationL5, L6, L7
Model-free vs Model-basedLearning from experience directly vs learning a model and planningL3-L11 vs L12
Experimentation & ReproducibilityProper experimental methodology in RL researchL10

Quick Reference: When to Use Which Method

Decision Guide for the Exam

SituationRecommended MethodWhy
Known model, small state spaceDynamic Programming (Value Iteration)Exact solution, no sampling needed
Unknown model, episodic, small state spaceMC or TD (tabular)Simple, guaranteed convergence
Unknown model, continuing tasksTD methodsMC requires episode end
Large/continuous state spaceFunction Approximation + TD/MCGeneralization needed
Off-policy with function approximationGTD2 or LSTDAvoids Deadly Triad divergence
Continuous action spacePolicy Gradient Methods or DPGQ-learning cannot easily maximize over continuous actions
Need stochastic policyREINFORCE, Actor-CriticBuilt-in exploration
Need deterministic optimal policyDPG / DDPGNo importance sampling needed
Maximum entropy / robust explorationSACEntropy-regularized objective
Sample efficiency criticalModel-based (Dyna, MCTS)Can generate synthetic experience
Partial observabilityPOMDP methods (belief states, RNNs)Full state not available
Offline data onlyCQL, Decision TransformerCannot collect new data

Other Important Topics

Do Not Forget

  • Exploration vs exploitation (throughout the course, especially Lecture 1)
  • Experimentation, evaluation & reproducibility (Lecture 10): proper methodology for comparing RL algorithms, statistical significance, hyperparameter sensitivity

References

  • Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
  • Lecture slides 1—14, Herke van Hoof, University of Amsterdam.