POMDP (Partially Observable Markov Decision Process)

POMDP

A Partially Observable Markov Decision Process extends the MDP framework to settings where the agent cannot directly observe the true state $x_{t}$ of the environment. Instead, the agent receives observations $o_{t}$ that provide partial or noisy information about the underlying state.

Formal Definition

A POMDP is defined by the tuple $(X, A, O, T, R, Z, γ)$ :

Component	Description
$X$	Set of (hidden/latent) states
$A$	Set of actions
$O$	Set of observations
$T(x’	x,a)$
$R (x, a)$	Reward function
$Z(o	x’,a)$
$γ$	Discount factor

The Challenge

Why Partial Observability Is Hard

In an MDP, the current state $s_{t}$ tells you everything you need to make an optimal decision. In a POMDP, the observation $o_{t}$ doesn’t — two different underlying states might produce the same observation. The agent must reason about what state it might be in based on its history of observations and actions.

History: $h_{t} = (o_{0}, a_{0}, o_{1}, a_{1}, \dots, o_{t})$ contains all available information
The history grows without bound — we need a compact sufficient statistic

Approaches to Handle Partial Observability

1. Belief State

Maintain a probability distribution over hidden states: $b_{t} (x) = Pr (X_{t} = x ∣ H_{t} = h_{t})$ . The belief state MDP is fully observable.

2. Predictive State Representation

Define internal state as predictions about future observations rather than beliefs about hidden states.

3. Approximate Methods

Use recent observations as state (frame stacking) or recurrent networks (Deep Recurrent Q-Learning).

Markov Criterion for Internal State

Markov Criterion

An internal state representation $f (h)$ is Markov if: $f (h) = f (h^{'}) ⟹ Pr {O_{t + 1} = o ∣ H_{t} = h, A_{t} = a} = Pr {O_{t + 1} = o ∣ H_{t} = h^{'}, A_{t} = a}$

That is, if two histories map to the same internal state, they must predict the same future observations.

Key Properties

POMDPs are strictly harder than MDPs (optimal POMDP policies may be stochastic even when MDP optimal policies are deterministic)
The belief state MDP converts a POMDP into a (continuous-state) MDP
In practice, many RL systems ignore partial observability and treat observations as states

Connections

Generalizes Markov Decision Process — MDP is a POMDP where observations = states
Solved via Belief State (exact) or approximate methods
Deep Recurrent Q-Learning — deep RL approach to POMDPs
Related to Importance Sampling in the sense that both deal with incomplete information

Appears In

RL-L13 - Partial Observability
RL-Book Ch17 - Frontiers (§17.3)

Study Notes

Explorer

POMDP

POMDP (Partially Observable Markov Decision Process)

Formal Definition

The Challenge

Approaches to Handle Partial Observability

1. Belief State

2. Predictive State Representation

3. Approximate Methods

Markov Criterion for Internal State

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks