Hierarchical Reinforcement Learning
Definition
Hierarchical Reinforcement Learning (HRL)
HRL decomposes a control problem into a hierarchy of policies operating at different levels of temporal abstraction. A high-level policy selects subgoals or temporally-extended actions (skills), and a low-level policy executes primitive actions to fulfil them. The central object is the option : an action that, once invoked, runs for many primitive time steps before returning control. This turns a flat MDP over primitive actions into a Semi-Markov Decision Process (SMDP) over options.
Intuition
Why decompose into a hierarchy
A flat agent must learn long, brittle sequences of primitive actions, and exploration via ε-greedy jitter rarely strings together the hundreds of correct micro-decisions needed to reach a distant reward. HRL attacks this with temporal abstraction: the high level reasons in coarse, reusable chunks (“walk to the door”, “navigate to city X”), so a single high-level decision commits the agent to a consistent multi-step behaviour. This shortens the effective horizon the top level sees, makes exploration directed (the agent jumps between subgoals rather than wiggling), and lets learned skills transfer across tasks that share sub-behaviours.
In the RL-ES01 - Exercise Set Week 1 driving example: a low-level controller learns “how to drive” (accelerator/brake), while a high-level controller learns “where to go” — the hybrid is exactly HRL.
Mathematical Formulation
An option over an MDP is the triple
where:
- — initiation set, the states in which may be started
- — the option’s internal (low-level) policy over primitive actions
- — termination condition, the probability the option ends in state
The high-level policy chooses options. Because an option runs for a random number of steps , the system is an SMDP. The Bellman equation for the option-value function uses multi-step, discounted option models:
where:
- — expected accumulated reward while executes
- — (random) number of primitive steps until triggers termination
- — discount applied across the whole option duration, not a single step
- — joint probability of terminating in after exactly steps
Intra-option / SMDP Q-learning update (learning the high level while options run):
where is the accumulated discounted reward over the steps the option ran and is the state at termination. The low-level is trained separately, typically on an intrinsic/subgoal reward rather than the environment reward.
Key Properties / Variants
- Options framework (Sutton, Precup & Singh): the canonical formalism above; a primitive action is just an option with that lasts one step, so HRL strictly generalizes flat RL.
- Feudal / goal-conditioned HRL (FeUdal Networks, HIRO): the high-level policy emits a goal vector every steps; the low-level policy is goal-conditioned and rewarded for reaching . HIRO uses off-policy goal relabelling to make manager transitions valid as the worker changes.
- Option-Critic: learns option policies and terminations end-to-end with Policy Gradients — no hand-designed subgoals.
- Benefits: directed exploration over a shorter effective horizon; transfer and reuse of skills across tasks; mitigates the curse of dimensionality / sparse rewards.
- Difficulties: non-stationarity (the low level shifts under the high level during joint training); discovering useful subgoals/options automatically is hard; defining good intrinsic rewards and termination is delicate.
Algorithm: SMDP Q-Learning over Options (high-level control)
────────────────────────────────────────────────────────────
Initialize Q(s, ω) for all states s and options ω
Loop for each episode:
Initialize S
Loop until S terminal:
Choose option ω from S using policy from Q (e.g. ε-greedy over ω ∈ available(S))
r ← 0; τ ← 0 # accumulated reward, elapsed steps
Loop (execute the option):
Choose primitive A ~ π_ω(·|S)
Take A, observe R, S'
r ← r + γ^τ · R
τ ← τ + 1
S ← S'
until terminate with prob. β_ω(S) or S terminal
# high-level (SMDP) update spanning the whole option
Q(S_start, ω) ← Q(S_start, ω) + α [ r + γ^τ · max_ω' Q(S, ω') − Q(S_start, ω) ]
S_start ← SConnections
- Generalizes: Markov Decision Process — flat MDP becomes an SMDP over options (primitive action = 1-step option)
- Builds on: Q-Learning / Temporal Difference Learning (the SMDP Q-update), Discount Factor (applied over option durations)
- Low-level skills trained via: Policy Gradient / Actor-Critic (e.g. Option-Critic)
- Addresses: Exploration vs Exploitation under sparse, long-horizon rewards; the curse of dimensionality
- Contrast: a flat Optimal Policy over primitive actions vs. a hierarchy of policies