RL-Book Ch8: Planning and Learning with Tabular Methods

Overview

This chapter unifies model-based RL (planning) and model-free RL (learning).

Model-based: Rely on a Model of the Environment to plan (e.g., Dynamic Programming).
Model-free: Rely on direct experience (e.g., Q-Learning, Temporal Difference Learning).

The core of both is the computation of value functions via backups (updates) to future events.

8.1 Models and Planning

Model of the Environment

Anything an agent can use to predict how the environment will respond to its actions.

Distribution Models: Produce all possible next states and rewards with their probabilities (e.g., $p (s^{'}, r ∣ s, a)$ ).

Sample Models: Produce just one possible transition, sampled according to the probabilities.

Planning

Planning is any computational process that takes a model as input and produces or improves a policy.

$model \to planning \to policy$

The unified view treats planning as simulated experience used to update value functions: $model simulated experience backups \to values \to policy$

8.2 Dyna: Integrated Planning, Acting, and Learning

Dyna-Q integrates Direct RL (learning from real experience) and Indirect RL (planning from simulated experience).

Dyna Architecture

graph TD
    Experience -- "Model Learning" --> Model
    Model -- "Planning (Search Control)" --> SimulatedExperience
    SimulatedExperience -- "Value Update" --> ValuesPolicy
    Experience -- "Direct RL" --> ValuesPolicy
    ValuesPolicy -- "Acting" --> Experience

Figure 8.1: The general Dyna Architecture.

Tabular Dyna-Q Algorithm

Tabular Dyna-Q

Initialize $Q (s, a)$ and $M o d e l (s, a)$

Loop forever:

(a) $S \leftarrow$ current state

(b) $A \leftarrow ϵ -greedy (S, Q)$

(c) Take action $A$ ; observe $R, S^{'}$

(d) $Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$ (Direct RL)

(e) $M o d e l (S, A) \leftarrow R, S^{'}$ (Model Learning)

(f) Loop repeat $n$ times (Planning):

$S \leftarrow$ random previously observed state

$A \leftarrow$ random action previously taken in $S$

$R, S^{'} \leftarrow M o d e l (S, A)$

$Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$

Intuition

Planning allows the agent to propagate value information faster than real-time interaction. In the Dyna Maze, an agent with $n = 50$ planning steps finds the optimal path in 3 episodes, while a non-planning agent ( $n = 0$ ) takes ~25.

8.3 When the Model Is Wrong

Models can be incorrect due to stochasticity, limited samples, or environment changes.

Dyna-Q+

When the environment changes to become better (e.g., a shortcut opens), standard Dyna-Q might never find it because its model claims that path is bad.

Exploration Bonus: Encourage the agent to test long-untried actions.
Update planning reward as: $R + κ τ$ , where $τ$ is the time steps since the state-action pair was last tried.

8.4 Prioritized Sweeping

Uniform random sampling of states in Dyna is inefficient. Prioritized Sweeping focuses on states whose values have recently changed significantly.

Backward Focusing

Work backward from a state whose value changed significantly to its predecessor states.

8.5 Expected vs. Sample Updates

Expected Updates: Use the full distribution $p (s^{'}, r ∣ s, a)$ . Perfect estimate but computationally expensive ( $b$ branching factor).
Sample Updates: Use a single sample $S^{'}, R$ . Noisy, but $b$ times cheaper.

Intuition

On problems with large branching factors $b$ , sample updates often converge to a better value estimate faster than expected updates because they can process $b$ different states in the same time an expected update processes one.

8.7 Real-time Dynamic Programming (RTDP)

RTDP is an on-policy trajectory-sampling version of value iteration.

It skips Irrelevant States (states unreachable from start states under any optimal policy).
Converges to optimality on relevant states without visiting every state indefinitely.

8.8 Planning at Decision Time

Background Planning: (e.g., Dyna) Plan continuously to improve a global policy/value function.
Decision-time Planning: Begin planning after encountering a state $S_{t}$ to pick a single action $A_{t}$ (e.g., Heuristic Search).

8.11 Monte Carlo Tree Search (MCTS)

MCTS is a rollout algorithm that focuses on the most promising parts of the search tree.

The 4 Phases of MCTS

Selection: Use a Tree Policy (e.g., UCB1) to traverse the tree from the root to a leaf.
Expansion: Add one or more child nodes to the tree.
Simulation: Perform a Rollout (using a simple policy) from the new node to the end of the episode.
Backup: Propagate the return of the simulation back up the tree to update node statistics.

MCTS in Go

MCTS allows evaluating moves in games with massive branching factors where global value approximation is difficult.

Summary: Dimensions of RL

The space of RL methods is defined by:

Depth of Update: 1-step (Tabular RL/TD) $⟷$ Full return (Monte Carlo).
Width of Update: Sample updates $⟷$ Expected updates (Dynamic Programming).
On-policy vs. Off-policy.
Real vs. Simulated Experience.

Study Notes

Explorer

RL-Book Ch8 - Planning and Learning

RL-Book Ch8: Planning and Learning with Tabular Methods

Overview

8.1 Models and Planning

Planning

8.2 Dyna: Integrated Planning, Acting, and Learning

Dyna Architecture

Tabular Dyna-Q Algorithm

8.3 When the Model Is Wrong

Dyna-Q+

8.4 Prioritized Sweeping

8.5 Expected vs. Sample Updates

8.7 Real-time Dynamic Programming (RTDP)

8.8 Planning at Decision Time

8.11 Monte Carlo Tree Search (MCTS)

The 4 Phases of MCTS

Summary: Dimensions of RL

Graph View

Table of Contents

Backlinks