RL Lecture 1: Introduction, MDPs & Bandits

1. Introduction to Reinforcement Learning

Reinforcement Learning (RL)

Reinforcement Learning is a computational approach to learning from interaction. It focuses on goal-directed learning where an agent learns what to do—how to map situations to actions—so as to maximize a numerical reward signal.

1.1 Key Characteristics

Unlike other machine learning paradigms, RL is distinguished by:

  1. Trial-and-error search: The learner is not told which actions to take but must discover which yield the most reward by trying them.
  2. Delayed reward: Actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.

1.2 RL vs. Other Paradigms

  • Supervised Learning: Learning from a training set of labeled examples provided by a supervisor. In RL, the agent must learn from its own experience in uncharted territory.
  • Unsupervised Learning: Finding hidden structure in unlabeled data. RL is focused on maximizing a reward signal rather than finding structure.

1.3 Elements of Reinforcement Learning

Beyond the agent and environment, we identify four main sub-elements:

  1. Policy: A mapping from perceived states of the environment to actions to be taken (the agent’s behavior).
  2. Reward Signal: Defines the goal; a single number sent by the environment at each time step that the agent seeks to maximize in the long run.
  3. Value Function: Specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
  4. Model of the Environment: (Optional) Something that mimics the behavior of the environment, used for planning.

Rewards vs. Values

Rewards are immediate (like pleasure/pain), while values are fartsighted (long-term desirability). We seek actions that lead to states of highest value, not necessarily highest immediate reward.


2. Multi-Armed Bandits

The k-armed bandit problem is a simplified RL setting that involves only a single state (nonassociative), focusing purely on the evaluative aspect of feedback.

2.1 The Problem Formulation

You are faced repeatedly with a choice among different actions. After each choice, you receive a numerical reward from a stationary probability distribution.

Action Value

The true value of an action , denoted , is the expected reward given that is selected: where:

  • : True value of action .
  • : Action selected at time .
  • : Reward received at time .

2.2 Action-Value Methods

We estimate using the sample-average method:

2.3 Incremental Implementation

To avoid storing all rewards, we update the average incrementally:

Incremental Update Rule

General form:

2.4 Exploration vs. Exploitation

  • Exploitation: Selecting the greedy action (the one with the highest ) to maximize immediate reward.
  • Exploration: Selecting non-greedy actions to improve estimates of their values.

Epsilon-Greedy Policy

An [[\epsilon-greedy policy]] selects the greedy action with probability , and a random action with probability .

Optimistic Initial Values

Setting initial estimates to a high value (higher than any likely reward) encourages exploration. The agent is “disappointed” by initial rewards and tries all actions before settling.

Upper Confidence Bound (UCB)

UCB selects actions based on both their estimated value and the uncertainty in that estimate.

UCB Action Selection

where:

  • : Total number of time steps.
  • : Number of times action has been selected.
  • : Confidence level (controls exploration).
  • The square-root term represents the uncertainty/variance.

3. Markov Decision Processes (MDPs)

Markov Decision Process (MDP)

A formalization of sequential decision-making where actions influence not just immediate rewards, but also subsequent states. An MDP is defined by its states, actions, rewards, and dynamics.

3.1 The Agent-Environment Interface

  • At each time step , the agent receives state .
  • The agent takes action .
  • The environment responds with a reward and a new state .

Agent-Environment Interaction

The process generates a trajectory:

graph LR
    Agent -- Action A_t --> Environment
    Environment -- Reward R_{t+1} --> Agent
    Environment -- State S_{t+1} --> Agent

3.2 Transition Dynamics

The dynamics of a finite MDP are completely defined by the probability:

Dynamics Function

This function defines the probability of transitioning to state with reward , given the current state and action .

3.3 Goals and Returns

The agent’s goal is to maximize the expected return .

Discounted Return

where is the discount factor.

  • : Agent is “myopic” (maximizes immediate reward).
  • : Agent is “farsighted” (weights future rewards heavily).

4. Value Functions & Bellman Equations

4.1 Policies

A Policy is the probability of taking action in state .

4.2 State-Value and Action-Value Functions

  • State-value function : Expected return starting from state following policy .
  • Action-value function : Expected return starting from , taking action , then following .

4.3 The Bellman Equation for

The Bellman Equation expresses a recursive relationship between the value of a state and its successor states.

Bellman Equation

Intuition: The value of the current state is the expected immediate reward plus the discounted value of the next state, averaged over all possible actions and outcomes.

4.4 Bellman Optimality Equations

The optimal value functions and satisfy the Bellman Optimality Equations, which don’t depend on a specific policy but assume the best action is always taken.

Bellman Optimality Equation for

Bellman Optimality Equation for

Gridworld

A simple 5x5 grid where certain cells (A, B) provide large rewards and transport the agent to other locations (A’, B’). Moving off the grid results in a reward of -1 and keeps the agent in place. Under an equiprobable random policy (north, south, east, west), states near the center have higher values, while those near edges have lower (often negative) values due to the risk of hitting the boundary.

Golf

The state is the location of the ball.

  • Rewards: -1 for each stroke until the ball is in the hole.
  • State Value : Negative of the number of strokes to the hole using only a putter. Contours represent regions from which 1, 2, or 3 putts are needed.
  • Action Value : Value after taking the first shot with a driver, then following an optimal policy (using driver or putter as appropriate). Driving allows reaching the green faster but with more risk/uncertainty.

5. Key Figures & Diagrams

5.1 Backup Diagrams

Backup diagrams are graphical summaries of value function updates.

  • State-value backup (): From a state (open circle), branch to possible actions (solid circles) via , then to next states (open circles) via .
  • Action-value backup (): From a state-action pair (solid circle), branch to possible next states (open circles), then to next actions (solid circles).

5.2 The 10-Armed Testbed

This figure (Sutton & Barto Fig 2.1) visualizes the distributions of rewards for a set of 10 arms. Each arm has a mean reward (sampled from a normal distribution) and actual rewards are sampled around that mean. This testbed is used to compare epsilon-greedy methods () showing that non-zero epsilon performs better in the long run by avoiding suboptimal convergence.


Summary Takeaways

  • RL Problem: Goal-directed interaction with an environment to maximize cumulative reward.
  • Bandits: Simplest case; one state, focus on exploration (-greedy, UCB, Optimistic Init).
  • MDPs: Full case; actions affect state transitions. Handled via value functions and policies.
  • Bellman Equation: The core recursive tool for evaluating policies by linking current and future values.
  • Exploration/Exploitation: The central dilemma of RL.