Markov Chain
Definition
Markov Chain (for Sequential Recommendation)
A Markov Chain models the probability of the next item in a chronologically ordered sequence of interactions conditioned on the recent history. A k-order Markov Chain assumes the next item depends only on the last items (the Markov assumption: the future is independent of the distant past given the recent past). In its simplest form (first-order, ) the model is fully described by a Transition Matrix whose entry is the probability of transitioning from item to item . This is the earliest family of Sequential Recommendation methods [Shani et al., 2005], motivated by the fact that Matrix Factorization ignores interaction order.
Intuition
Order matters; MF throws it away
Standard Collaborative Filtering / Matrix Factorization treats a user’s interactions as an unordered set — it cannot tell that “phone → phone case” is sensible but “phone case → phone” is not, nor that Star Wars IV precedes V precedes VI. A Markov Chain fixes this by directly modeling item-to-item transitions: if many users who interacted with next interacted with , then is high and is recommended after . The “memory” is deliberately short ( items) because conditioning on the entire history is statistically infeasible — there are too few sequences that match a long context exactly.
Mathematical Formulation
The general next-item objective over the full history is
The k-order Markov assumption truncates the conditioning context to the last items:
For a first-order chain () the model reduces to a single Transition Matrix estimated by counting consecutive co-occurrences (an n-gram / maximum-likelihood estimate):
where:
- — the chronologically ordered interaction (or basket) sequence
- — the order (memory length); gives a first-order chain, the standard case
- — the global transition matrix over the item catalog ; rows = “from” item, columns = “to” item; each row sums to 1
- — number of observed transitions from item directly to item in the training data
- the denominator (the support, often shown as a ”#” column) is the total number of transitions observed out of item
The chain is non-parametric: it stores raw transition counts rather than learned embeddings, so it does not scale capacity with a dimensionality hyperparameter (unlike factorized models).
Key Properties / Variants
- Global, non-personalized. A single matrix is shared by all users — the same transition behavior applies to everyone. This is the chief limitation that Factorized Personalized Markov Chains (FPMC) addresses with per-user matrices.
- First-order only by default. Captures only the immediately preceding item’s influence; higher increases context but worsens sparsity exponentially.
- Data sparsity is the central problem — most item pairs are never observed consecutively, so for almost all . Mitigations:
- Skipping — lends some likelihood to (allow gaps in the sequence).
- Clustering — treat similar contexts as equivalent, .
- Mixture modeling — interpolate transition estimates across several orders (mix n-gram models).
- Personalization via factorization. Per-user transition matrices are far too sparse to estimate directly; FPMC factorizes the user × from-item × to-item transition cube into shared latent factors, combining a long-term Matrix Factorization term with a short-term factorized first-order transition.
- Where it wins / loses. On sparse data or with a tight compute budget, simple Markov-chain models (and FPMC) can outperform deep models like GRU4Rec; with abundant data, deep Sequential Recommendation models dominate because Markov chains see only first-order dependencies.
Estimating the first-order transition matrix:
Algorithm: Fully-Parametrized First-Order Markov Chain
──────────────────────────────────────────────────────
Input: training sequences {<i_1,...,i_n>} for all users
Initialize count[i][j] = 0 for all item pairs (i,j)
For each user sequence <i_1, ..., i_n>:
For t = 1 .. n-1:
count[i_t][i_{t+1}] += 1 # observe transition i_t -> i_{t+1}
For each from-item i:
total = sum_j count[i][j] # support (# of transitions out of i)
For each to-item j:
T[i][j] = count[i][j] / total # MLE transition probability (0 if total = 0)
Recommend after observing item i_n:
return argmax_j T[i_n][j] # (or top-K ranking of row i_n)Connections
- Special case / motivates: FPMC — personalized + factorized first-order Markov chain (adds a long-term Matrix Factorization term)
- Core object: Transition Matrix
- Family: Sequential Recommendation / Next-Item Prediction
- Contrasts with: Matrix Factorization and Collaborative Filtering (order-agnostic; treat interactions as an unordered set)
- Superseded by (deep methods): GRU4Rec, SASRec, BERT4Rec
- Same name, different setting: the Markov Decision Process / Markov Chain state-transition formalism in Reinforcement Learning