POMDP (Partially Observable Markov Decision Process)

POMDP

A Partially Observable Markov Decision Process extends the MDP framework to settings where the agent cannot directly observe the true state of the environment. Instead, the agent receives observations that provide partial or noisy information about the underlying state.

Formal Definition

A POMDP is defined by the tuple :

ComponentDescription
Set of (hidden/latent) states
Set of actions
Set of observations
$T(x’x,a)$
Reward function
$Z(ox’,a)$
Discount factor

The Challenge

Why Partial Observability Is Hard

In an MDP, the current state tells you everything you need to make an optimal decision. In a POMDP, the observation doesn’t — two different underlying states might produce the same observation. The agent must reason about what state it might be in based on its history of observations and actions.

  • History: contains all available information
  • The history grows without bound — we need a compact sufficient statistic

Approaches to Handle Partial Observability

1. Belief State

Maintain a probability distribution over hidden states: . The belief state MDP is fully observable.

2. Predictive State Representation

Define internal state as predictions about future observations rather than beliefs about hidden states.

3. Approximate Methods

Use recent observations as state (frame stacking) or recurrent networks (Deep Recurrent Q-Learning).

Markov Criterion for Internal State

Markov Criterion

An internal state representation is Markov if:

That is, if two histories map to the same internal state, they must predict the same future observations.

Key Properties

  • POMDPs are strictly harder than MDPs (optimal POMDP policies may be stochastic even when MDP optimal policies are deterministic)
  • The belief state MDP converts a POMDP into a (continuous-state) MDP
  • In practice, many RL systems ignore partial observability and treat observations as states

Connections

Appears In