Reinforcement Learning — Overview

Course: Reinforcement Learning 2025/2026 (Period 4) Programme: MSc Artificial Intelligence, UvA Credits: 6 EC (168 hours total) Instructors: Herke van Hoof, Christian Andersson Naesseth Exam: March 27, 2026 (65% of final grade, must score ≥ 5.0)

Textbooks

  • Primary: Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (RL:AI)
    • Chapters covered: 1-6, 8-11, 13, 16, 17
    • Free PDF
  • Secondary: A Survey on Policy Search for Robotics — Deisenroth, Neumann, Peters
  • Additional papers distributed via Canvas

Assessment

ComponentWeightNotes
5 Coding Assignments2% each (10% total)Groups of 2, late = -1 point
5 Homework Sets4% each (20% total)Groups of 2, due Thursdays 17:00
Empirical RL Assignment5%
Exam65%Must score ≥ 5.0

Weekly Schedule

Week 1 — Foundations

TopicLiteratureNotes
L1Intro, MDPs & BanditsRL:AI 1.1-1.4, 1.6, 2.1-2.4, 2.6-2.7, 3.1-3.3RL-L01 - Intro, MDPs & Bandits
L2Dynamic ProgrammingRL:AI 2.5, 3.4-3.8, 4RL-L02 - Dynamic Programming
BookRL-Book Ch1 - Introduction, RL-Book Ch2 - Multi-Armed Bandits, RL-Book Ch3 - Finite MDPs, RL-Book Ch4 - Dynamic Programming
ExercisesEx 0.1-0.2, 1.1-1.3, 2.1-2.2RL-ES01 - Exercise Set Week 1
HomeworkHW 2.3-2.4Due 12/2RL-HW01 - Homework 1
CodingCA1: Dynamic ProgrammingRL-CA01 - Dynamic Programming

Week 2 — Tabular Methods

TopicLiteratureNotes
L3Monte Carlo MethodsRL:AI 5.1-5.7RL-L03 - Monte Carlo Methods
L4Temporal Difference MethodsRL:AI 6.1-6.5RL-L04 - Temporal Difference Learning
BookRL-Book Ch5 - Monte Carlo Methods, RL-Book Ch6 - Temporal-Difference Learning
ExercisesEx 3.1-3.3, 4.1-4.2RL-ES02 - Exercise Set Week 2
HomeworkHW 3.4, 4.3Due 19/2RL-HW02 - Homework 2
CodingCA2: Monte CarloRL-CA02 - Monte Carlo

Week 3 — From Tabular to Approximation

TopicLiteratureNotes
L5From Tabular Learning to ApproximationRL:AI 9.1-9.3RL-L05 - Tabular to Approximation
L6On-policy TD Learning with ApproximationRL:AI 9.3-9.4, 9.7-9.8RL-L06 - On-Policy TD with Approximation
BookRL-Book Ch9 - On-Policy Prediction with Approximation
ExercisesEx 5.1-5.2, 6.1-6.4RL-ES03 - Exercise Set Week 3
HomeworkHW 5.3-5.4, 6.5Due 26/2RL-HW03 - Homework 3
CodingCA3: Temporal DifferenceRL-CA03 - Temporal Difference

Week 4 — Off-Policy Approximation & Deep RL (current)

TopicLiteratureNotes
L7Off-policy RL with ApproximationRL:AI 10.1, 11.1-11.7RL-L07 - Off-Policy RL with Approximation
L8Deep RL (Value-Based Methods)RL:AI 16.5; DQN & CQL papersRL-L08 - Deep RL Value-Based
BookRL-Book Ch10 - On-Policy Control with Approximation, RL-Book Ch11 - Off-Policy Methods with Approximation
ExercisesEx 7.1-7.3, 8.1
HomeworkHW 7.4, 8.2-8.3Due 5/3
Coding

Week 5 — Policy Gradient Methods

TopicLiteratureNotes
L9REINFORCERL:AI 13.1-13.4, 13.7; Survey §2.4.1.2
L10PGT, DPG & EvaluationRL:AI 13.5 + papers

Week 6 — Advanced Methods

TopicLiteratureNotes
L11SAC, Decision Transformer, Decision DiffuserPapers
L12Planning and LearningRL:AI 8.1-8.2, 8.8, 8.10-8.11, 8.13, 16.6

Week 7 — Wrap-Up

TopicLiteratureNotes
L13Partial ObservabilityRL:AI 17.3
L14Recap & Exam FAQ
ExamMarch 27, 2026

Exam Prep


Concept Index

Key concepts covered in this course (see Concepts folder):

Foundations: Markov Decision Process · Bellman Equation · Value Function · Policy · Return · Discount Factor · Multi-Armed Bandit

Tabular Methods: Dynamic Programming · Policy Iteration · Value Iteration · Monte Carlo Methods · Temporal Difference Learning · SARSA · Q-Learning · Expected SARSA

Exploration: Exploration vs Exploitation · Epsilon-Greedy Policy · Exploring Starts · Upper Confidence Bound · Importance Sampling

Approximation: Function Approximation · Stochastic Gradient Descent · Semi-Gradient Methods · Linear Function Approximation · Feature Construction · Tile Coding · LSTD

Deep RL: Deep Q-Network (DQN) · Experience Replay · Target Network · Conservative Q-Learning (CQL)

Off-Policy: On-Policy vs Off-Policy · Deadly Triad · Gradient-TD Methods

Policy Gradient: (weeks 5-6) REINFORCE · Policy Gradient Theorem · Actor-Critic · Soft Actor-Critic (SAC)