RL-Book Chapter 17: Frontiers

Overview

Chapter 17 explores the “frontiers” of Reinforcement Learning, touching on topics that extend beyond the standard Markov Decision Process framework. It covers general value functions, temporal abstraction (options), the critical transition from states to observations (partial observability), reward design, and the open challenges facing the field.


17.1 General Value Functions and Auxiliary Tasks

The concept of a Value Function is generalized to General Value Functions (GVFs), which predict arbitrary signals rather than just reward.

Cumulant

The signal being predicted in a GVF is called the cumulant, denoted .

General Value Function (GVF)

where is a state-dependent termination/Discount Factor function.

  • Auxiliary Tasks: Learning to predict multiple signals (e.g., pixel changes, future rewards) forces the agent to develop robust internal representations that often accelerate learning on the main task.
  • Pavlovian Control: Built-in reflexes triggered by learned predictions (e.g., a robot heading to a charger when it predicts low battery).

17.2 Temporal Abstraction via Options

MDPs can be applied at multiple time scales. Options formalize higher-level actions that persist over multiple time steps.

Option

An option is a pair consisting of a Policy and a termination function .

  • Option Models: Consist of two parts:
    1. Reward part: Expected cumulative reward during execution.
    2. State-transition part: The discounted probability of terminating in a specific state.
  • Bellman Equation for Options:

17.3 Observations and State (Detailed)

Throughout the book, it was assumed the agent perceives the environment’s state directly. In reality, agents receive observations which provide only partial information.

Partial Observability

While Function Approximation can implicitly handle some partial observability (by choosing parameters that don’t depend on hidden variables), an explicit treatment is required for complex environments.

History and State

  • History (): The sequence of all past actions and observations: .
  • State (): A compact summary of the history, .

Markov State

A state is Markov if it summarizes all information in the history necessary for predicting future observations. Formally, for any future test .

The State-Update Function

To remain compact and computable, the state must be updated incrementally:

Two Approaches to Representation

  1. POMDPs (Partially Observable MDPs):

    • Assumes a “latent” environment state that is never directly seen.
    • The agent maintains a Belief State: a probability distribution over possible latent states.
    • Updated via Bayes’ Rule:

    Belief State Update

    • Critique: Scales poorly and relies on unobservable semantics ().
  2. PSRs (Predictive State Representations):

    • Grounded in observable data.
    • State is defined as a vector of probabilities for “core tests” (specific future action-observation sequences).

Approximate State

In practice, is rarely perfectly Markov. Common heuristics:

  • Immediate Observation: .
  • -th Order History: .
  • Feature-based: Multiple GVFs/auxiliary tasks provide features for the state representation.

The Heuristic of Representation

A state that is good for predicting many different things (auxiliary tasks) is likely to contain the information necessary for predicting reward and making optimal decisions.


17.4 Designing Rewards

Reinforcement Learning depends heavily on the reward signal, which is the designer’s way of communicating the goal.

  • Sparse Rewards: The “plateau problem” where the agent wanders without feedback.
  • Value Function Initialization: A way to guide learning without changing the reward:
  • Shaping: Gradually changing the reward signal or task dynamics to lead the agent toward the goal (Skinner).
  • Inverse RL: Learning the reward signal by observing an expert’s behavior.

Reward Design Risks

Agents may find “loopholes” or unexpected ways to maximize reward that violate the designer’s intent.


17.5 Remaining Issues

  1. Incremental Deep Learning: Overcoming “catastrophic interference” in online settings.
  2. Representation Learning: Automating the construction of the state-update function.
  3. Planning with Learned Models: Scaling Dyna-like architectures to complex function approximation.
  4. Automated Task Selection: How agents can choose their own subgoals/auxiliary tasks.
  5. Curiosity: Using “intrinsic reward” to drive exploration and learning progress.

17.6 Future of AI

  • Complete Agents: Transitioning from superhuman performance in narrow domains to interactive, generalist agents.
  • Safety: Ensuring optimization doesn’t lead to dangerous unintended consequences.
  • Prometheus vs. Pandora: AI as a tool that can either solve global challenges (fire) or release new perils (the box).

Summary

Chapter 17 shifts from how to learn given a state, to what to learn (GVFs, Options) and how to represent the environment (Observations, State-Update). The fundamental message is that the future of RL lies in an agent-centric view where representation, tasks, and goals are discovered and curated by the agent itself.