Hierarchical Reinforcement Learning

Definition

Hierarchical Reinforcement Learning (HRL)

HRL decomposes a control problem into a hierarchy of policies operating at different levels of temporal abstraction. A high-level policy selects subgoals or temporally-extended actions (skills), and a low-level policy executes primitive actions to fulfil them. The central object is the option $ω = ⟨ I_{ω}, π_{ω}, β_{ω} ⟩$ : an action that, once invoked, runs for many primitive time steps before returning control. This turns a flat MDP over primitive actions into a Semi-Markov Decision Process (SMDP) over options.

Intuition

Why decompose into a hierarchy

A flat agent must learn long, brittle sequences of primitive actions, and exploration via ε-greedy jitter rarely strings together the hundreds of correct micro-decisions needed to reach a distant reward. HRL attacks this with temporal abstraction: the high level reasons in coarse, reusable chunks (“walk to the door”, “navigate to city X”), so a single high-level decision commits the agent to a consistent multi-step behaviour. This shortens the effective horizon the top level sees, makes exploration directed (the agent jumps between subgoals rather than wiggling), and lets learned skills transfer across tasks that share sub-behaviours.

In the RL-ES01 - Exercise Set Week 1 driving example: a low-level controller learns “how to drive” (accelerator/brake), while a high-level controller learns “where to go” — the hybrid is exactly HRL.

Mathematical Formulation

An option $ω$ over an MDP is the triple

$ω = ⟨ I_{ω}, π_{ω}, β_{ω} ⟩$

where:

$I_{ω} \subseteq S$ — initiation set, the states in which $ω$ may be started
$π_{ω} (a ∣ s)$ — the option’s internal (low-level) policy over primitive actions
$β_{ω} (s) \in [0, 1]$ — termination condition, the probability the option ends in state $s$

The high-level policy $μ (ω ∣ s)$ chooses options. Because an option runs for a random number of steps $k$ , the system is an SMDP. The Bellman equation for the option-value function $Q_{μ} (s, ω)$ uses multi-step, discounted option models:

$Q_{μ} (s, ω) = \sum_{s^{'}, k} P (s^{'}, k ∣ s, ω) [r (s, ω) + γ^{k} \sum_{ω^{'}} μ (ω^{'} ∣ s^{'}) Q_{μ} (s^{'}, ω^{'})]$

where:

$r (s, ω) = E [R_{t + 1} + γ R_{t + 2} + \dots + γ^{k - 1} R_{t + k} ∣ s, ω]$ — expected accumulated reward while $ω$ executes
$k$ — (random) number of primitive steps until $β_{ω}$ triggers termination
$γ^{k}$ — discount applied across the whole option duration, not a single step
$P (s^{'}, k ∣ s, ω)$ — joint probability of terminating in $s^{'}$ after exactly $k$ steps

Intra-option / SMDP Q-learning update (learning the high level while options run):

$Q (s_{t}, ω) \leftarrow Q (s_{t}, ω) + α [SMDP TD target r + γ^{k} ω^{'} max Q (s_{t + k}, ω^{'}) - Q (s_{t}, ω)]$

where $r$ is the accumulated discounted reward over the $k$ steps the option ran and $s_{t + k}$ is the state at termination. The low-level $π_{ω}$ is trained separately, typically on an intrinsic/subgoal reward $r^{int}$ rather than the environment reward.

Key Properties / Variants

Options framework (Sutton, Precup & Singh): the canonical formalism above; a primitive action is just an option with $β \equiv 1$ that lasts one step, so HRL strictly generalizes flat RL.
Feudal / goal-conditioned HRL (FeUdal Networks, HIRO): the high-level policy emits a goal vector $g_{t}$ every $c$ steps; the low-level policy is goal-conditioned $π (a ∣ s, g)$ and rewarded for reaching $g_{t}$ . HIRO uses off-policy goal relabelling to make manager transitions valid as the worker changes.
Option-Critic: learns option policies $π_{ω}$ and terminations $β_{ω}$ end-to-end with Policy Gradients — no hand-designed subgoals.
Benefits: directed exploration over a shorter effective horizon; transfer and reuse of skills across tasks; mitigates the curse of dimensionality / sparse rewards.
Difficulties: non-stationarity (the low level shifts under the high level during joint training); discovering useful subgoals/options automatically is hard; defining good intrinsic rewards and termination is delicate.

Algorithm: SMDP Q-Learning over Options (high-level control)
────────────────────────────────────────────────────────────
Initialize Q(s, ω) for all states s and options ω
 
Loop for each episode:
  Initialize S
  Loop until S terminal:
    Choose option ω from S using policy from Q   (e.g. ε-greedy over ω ∈ available(S))
    r ← 0;  τ ← 0                                  # accumulated reward, elapsed steps
    Loop (execute the option):
      Choose primitive A ~ π_ω(·|S)
      Take A, observe R, S'
      r ← r + γ^τ · R
      τ ← τ + 1
      S ← S'
    until terminate with prob. β_ω(S)  or  S terminal
    # high-level (SMDP) update spanning the whole option
    Q(S_start, ω) ← Q(S_start, ω) + α [ r + γ^τ · max_ω' Q(S, ω') − Q(S_start, ω) ]
    S_start ← S

Connections

Generalizes: Markov Decision Process — flat MDP becomes an SMDP over options (primitive action = 1-step option)
Builds on: Q-Learning / Temporal Difference Learning (the SMDP Q-update), Discount Factor (applied over option durations)
Low-level skills trained via: Policy Gradient / Actor-Critic (e.g. Option-Critic)
Addresses: Exploration vs Exploitation under sparse, long-horizon rewards; the curse of dimensionality
Contrast: a flat Optimal Policy over primitive actions vs. a hierarchy of policies

Appears In

RL-ES01 - Exercise Set Week 1

Study Notes

Explorer

Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents