Next-Item Prediction

Definition

Next-Item Prediction

Next-item prediction is the core objective of Sequential Recommendation: given a chronologically ordered history of a user’s past interactions , predict the next item the user will interact with (click, watch, purchase, listen to).

It is the recommendation analogue of next-token prediction in language modelling: use the past sequence to predict the next output. Two solution families share this objective:

  • Discriminative (score-and-rank): learn a score for each candidate item and rank — e.g. SASRec, BERT4Rec, GRU4Rec.
  • Generative: decode an item identifier token-by-token and look it up in the catalogue — e.g. P5, TIGER, OneRec.

Intuition

User behaviour is already a sequence

A user’s interaction log is a temporal sequence, exactly like a sentence is a sequence of tokens. So next-item prediction inherits the entire sequence-modelling toolkit:

  • the classical view encodes the history into a state and dots it against every candidate item embedding (a softmax over the whole catalogue);
  • the generative view instead produces the answer directly, sidestepping the per-candidate scan.

The objective is the same; what differs is the output space. Classical models output a distribution over catalogue items; generative models output a sequence of identifier tokens that must map back to a real item.

Mathematical Formulation

The classical (discriminative) formulation scores each catalogue item against the encoded history. For SASRec, the encoder produces a state and the score for item is the inner product with its embedding row :

where:

  • — chronological user interaction history
  • — encoded history state from a causal self-attention encoder
  • — learned embedding (row of the shared item table ) for candidate item ; vocabulary (one atomic id per item)
  • — score; higher means is a more likely next interaction

The generative formulation replaces direct scoring with autoregressive decoding of an item identifier (e.g. a semantic id):

where:

  • — user history (each , expanded to its id tokens)
  • — fixed-length identifier of item ( tokens; atomic ids are the special case )
  • — identifier tokens already generated (left context)
  • — the identifier’s log-likelihood, reused as the item score for ranking

Both families are trained the same way: next-token cross-entropy with teacher forcing over the target id (for atomic ids this is a single position, ):

where the loss is averaged over all positions and all items in the batch. The tokens are item codes from a small learned codebook (), not a natural-language vocabulary.

Key Properties / Variants

  • Discriminative skeleton (encode → score → rank). SASRec (causal self-attention), BERT4Rec (masked, bidirectional), GRU4Rec (GRU) differ only in how they encode ; all keep the score-and-rank step. Output space = catalogue size, so the final layer is a softmax over millions of items.
  • Why move beyond atomic ids. An atomic id like item_3487 carries no semantics, the embedding table grows linearly with the catalogue, and every new item needs a fresh id and a trained embedding (strict Cold Start).
  • Generative variant — decode an identifier. Reframes next-item prediction as autoregressive id generation. Encoder–decoder (T5-style: TIGER, LETTER) reads the history fully then writes the id; decoder-only (GPT-style: HSTU, OneRec) treats [history || target] as one stream. Scales to long histories and unifies retrieval + ranking.
  • Item tokenization is a modelling choice. Semantic IDs built by residual-quantized VAE give ids from only tokens (e.g. ), share prefixes for related items (cold-start generalisation), and decouple capacity from vocabulary size.
  • Validity constraint (generative only). Most of the codes are not real items, so decoding is trie-constrained: at each step a logit mask permits only tokens lying on a valid catalogue path. Alternatively, validity is rewarded during RL fine-tuning (GRPO).
  • Inference = beam search. Greedy decoding yields one item; beam search keeps partial ids to produce a ranked list. Pathologies: popularity-prefix amplification, homogeneous (look-alike) lists, local optima from greedy first tokens, and -step latency.
  • Optional reward fine-tuning. Cross-entropy only rewards copying the exact next click; RL (GRPO / DPO) lets the model generate a group of candidates, score them for validity / relevance / diversity, and push above-average ones up.
Algorithm: Generative Next-Item Prediction (SID-based, inference)
─────────────────────────────────────────────────────────────────
Given: trained model p_θ, frozen tokenizer, trie of valid catalogue SIDs
Input: user history H = (i_1, ..., i_t)
 
1. Look up each item's SID:  x ← flatten[ SID(i_1), ..., SID(i_t) ]
2. Initialize beam B with the empty prefix
3. For ℓ = 1 .. L:                      # L codebook positions per item
     For each partial SID b in beam:
       allowed ← trie.children(b)        # validity constraint
       score next tokens with p_θ(z_ℓ | x, b), masking ∉ allowed
     beam ← top-B partial SIDs by cumulative log-prob
4. Map each completed SID → catalogue item (id-to-item lookup)
5. Filter (drop items already in H), dedup, apply business rules
6. Return ranked list

Connections

Appears In