Next-Item Prediction
Definition
Next-Item Prediction
Next-item prediction is the core objective of Sequential Recommendation: given a chronologically ordered history of a user’s past interactions , predict the next item the user will interact with (click, watch, purchase, listen to).
It is the recommendation analogue of next-token prediction in language modelling: use the past sequence to predict the next output. Two solution families share this objective:
Intuition
User behaviour is already a sequence
A user’s interaction log is a temporal sequence, exactly like a sentence is a sequence of tokens. So next-item prediction inherits the entire sequence-modelling toolkit:
- the classical view encodes the history into a state and dots it against every candidate item embedding (a softmax over the whole catalogue);
- the generative view instead produces the answer directly, sidestepping the per-candidate scan.
The objective is the same; what differs is the output space. Classical models output a distribution over catalogue items; generative models output a sequence of identifier tokens that must map back to a real item.
Mathematical Formulation
The classical (discriminative) formulation scores each catalogue item against the encoded history. For SASRec, the encoder produces a state and the score for item is the inner product with its embedding row :
where:
- — chronological user interaction history
- — encoded history state from a causal self-attention encoder
- — learned embedding (row of the shared item table ) for candidate item ; vocabulary (one atomic id per item)
- — score; higher means is a more likely next interaction
The generative formulation replaces direct scoring with autoregressive decoding of an item identifier (e.g. a semantic id):
where:
- — user history (each , expanded to its id tokens)
- — fixed-length identifier of item ( tokens; atomic ids are the special case )
- — identifier tokens already generated (left context)
- — the identifier’s log-likelihood, reused as the item score for ranking
Both families are trained the same way: next-token cross-entropy with teacher forcing over the target id (for atomic ids this is a single position, ):
where the loss is averaged over all positions and all items in the batch. The tokens are item codes from a small learned codebook (–), not a natural-language vocabulary.
Key Properties / Variants
- Discriminative skeleton (encode → score → rank). SASRec (causal self-attention), BERT4Rec (masked, bidirectional), GRU4Rec (GRU) differ only in how they encode ; all keep the score-and-rank step. Output space = catalogue size, so the final layer is a softmax over millions of items.
- Why move beyond atomic ids. An atomic id like
item_3487carries no semantics, the embedding table grows linearly with the catalogue, and every new item needs a fresh id and a trained embedding (strict Cold Start). - Generative variant — decode an identifier. Reframes next-item prediction as autoregressive id generation. Encoder–decoder (T5-style: TIGER, LETTER) reads the history fully then writes the id; decoder-only (GPT-style: HSTU, OneRec) treats
[history || target]as one stream. Scales to long histories and unifies retrieval + ranking. - Item tokenization is a modelling choice. Semantic IDs built by residual-quantized VAE give ids from only tokens (e.g. ), share prefixes for related items (cold-start generalisation), and decouple capacity from vocabulary size.
- Validity constraint (generative only). Most of the codes are not real items, so decoding is trie-constrained: at each step a logit mask permits only tokens lying on a valid catalogue path. Alternatively, validity is rewarded during RL fine-tuning (GRPO).
- Inference = beam search. Greedy decoding yields one item; beam search keeps partial ids to produce a ranked list. Pathologies: popularity-prefix amplification, homogeneous (look-alike) lists, local optima from greedy first tokens, and -step latency.
- Optional reward fine-tuning. Cross-entropy only rewards copying the exact next click; RL (GRPO / DPO) lets the model generate a group of candidates, score them for validity / relevance / diversity, and push above-average ones up.
Algorithm: Generative Next-Item Prediction (SID-based, inference)
─────────────────────────────────────────────────────────────────
Given: trained model p_θ, frozen tokenizer, trie of valid catalogue SIDs
Input: user history H = (i_1, ..., i_t)
1. Look up each item's SID: x ← flatten[ SID(i_1), ..., SID(i_t) ]
2. Initialize beam B with the empty prefix
3. For ℓ = 1 .. L: # L codebook positions per item
For each partial SID b in beam:
allowed ← trie.children(b) # validity constraint
score next tokens with p_θ(z_ℓ | x, b), masking ∉ allowed
beam ← top-B partial SIDs by cumulative log-prob
4. Map each completed SID → catalogue item (id-to-item lookup)
5. Filter (drop items already in H), dedup, apply business rules
6. Return ranked listConnections
- Core task of: Sequential Recommendation · Session-based Recommendation
- Classical encoders: SASRec · BERT4Rec · GRU4Rec
- Generative route: Generative Recommendation · Generative Retrieval · Autoregressive Generation
- Identifier design: Atomic Item IDs · Semantic IDs · Item Tokenization · RQ-VAE
- Decoding: Beam Search · Trie-Constrained Decoding
- Objectives & tuning: Supervised Fine-Tuning (SFT) · GRPO · Direct Preference Optimization (DPO)
- Scaling-native architectures: HSTU · Large Recommendation Models (LRM)
- Failure mode for new items: Cold Start
- Evaluation: Top-K Recommendation · NDCG · Recall