Item Tokenization
Lecture context
Turning items into discrete identifiers/tokens an LLM (or any autoregressive decoder) can read and generate.
Definition
Item Tokenization
Item tokenization is the design choice of how each catalogue item is mapped to a fixed-length sequence of discrete tokens , so that a generative model can produce items the same way a language model produces text. It is the recommendation analogue of subword tokenization: language models generate text tokens, generative recommenders generate item tokens.
It is the third of the three alignment paradigms for adapting an LLM to recommendation (after text prompting and injecting collaborative signal), and it answers the central question of Generative Recommendation: how do items become tokens an LLM can generate? The tokenizer defines the output space the model must learn to decode, so it is a modelling choice, not mere preprocessing.
Intuition
Why not just keep the item ID?
Classical sequential recommenders (SASRec, BERT4Rec, GRU4Rec) treat each item as one atomic ID with its own learned embedding and score it: . That breaks down for generation:
- Scale: the output space equals the catalogue, so a softmax over – items.
- Arbitrary:
item_3487carries no information; two similar films get two unrelated tokens.- Cold start: every new item needs a brand-new token and a freshly trained embedding before it is recommendable.
The opposite extreme — use the item’s full text/description as the ID — is meaningful but produces very long sequences that are expensive to decode and hard to constrain to real items.
Semantic IDs are the middle ground: a short tuple of tokens drawn from small shared codebooks, where related items share a prefix. This separates capacity ( possible codes) from vocabulary size ( tokens), giving compact, structured, generable identifiers.
Mathematical Formulation
A semantic ID uses token positions, each chosen from a codebook of size . The representational capacity is exponential in length while the vocabulary stays tiny:
where:
- — identifier length (number of codebook levels), e.g. –
- — codebook size per position, e.g. –
- — the semantic ID of item ;
The canonical learned tokenizer (TIGER) is RQ-VAE — a residual-quantized VAE. An encoder maps the item content embedding (e.g. a Sentence-T5 vector of title/brand/category) to a latent , then residual quantization runs over codebooks: at each level it picks the nearest codeword, subtracts it, and passes the residual to the next level:
The selected codewords sum to the quantized latent , which a decoder reconstructs back to . The semantic ID is the tuple of chosen indices . The tokenizer is trained with reconstruction plus a quantization (commitment + codebook) term:
where:
- — codeword in the codebook at level
- — residual after subtracting the level- codeword (drives the coarse→fine hierarchy)
- — squared reconstruction error of the item embedding
- — quantization loss pulling residuals toward their nearest codewords
After tokenization the IDs are usually frozen: the downstream generator predicts the indices, not the continuous vectors, autoregressively:
where is the user history and the identifier likelihood doubles as the item’s recommendation score.
Key Properties / Variants
- The L1–L5 ladder of item identifiers (RS-L03b):
- L1 — Atomic ID (P5, CLLM4Rec): one special token per item. Simple lookup, but vocabulary blows up and tokens carry no semantics.
- L2 — Text-based (BIGRec, M6): use the item title/description. Meaningful but very long sequences, no collaborative info.
- L3 — Codebook-based (TIGER, LC-Rec): discrete semantic IDs from RQ-VAE. Compact + semantic — the canonical sweet spot.
- L4 — Codebook + CF (LETTER, TokenRec, CCFRec): inject collaborative signal into the quantizer so one ID carries both language and behaviour.
- L5 — Adaptive (SIIT): the LLM refines identifiers during training; tokens evolve with the model.
- Hierarchical prefixes = coarse-to-fine semantics. Earlier indices are broad (e.g. a category like “Sports”), later ones refine the residual; items sharing are coarsely similar and diverge later. This prefix structure is what enables cold-start generalization and controllable/diverse decoding.
- Collision handling. Distinct items can map to the same tuple, so an extra disambiguating token is appended: , guaranteeing each final ID maps to exactly one item.
- Construction families (RS-L04): Residual Quantization (RQ-VAE, RQ-KMeans, R-VQ — ordered coarse→fine); Product Quantization (split the embedding, quantize subspaces — VQ-Rec); Hierarchical Clustering (tree-path IDs — P5-CID, RecForest); LM/Textual IDs (language tokens — LMIndexer, IDGenRec). No family is universally best — it depends on the embedding space, catalogue, and downstream task.
- What shapes a semantic ID. Two axes: what representation we quantize (text, multimodal, categorical, or raw) and what objective we learn it with. The field is moving from static content-only IDs toward behaviour-aware, context-aware, task-aware IDs:
- CoST adds a contrastive objective so quantized codes preserve neighbourhood structure, not just reconstruction.
- LETTER adds three regularizers — semantic hierarchy, collaborative (CF) alignment, and diversity for balanced code usage.
- ActionPiece makes tokens context-dependent: the same action receives different tokens depending on surrounding actions (a subword-style merge over feature sets).
- Content vs collaborative signal. Content IDs capture what items are (title, brand, image); collaborative signal captures how users use items together (co-consumption). “A semantic ID is only as good as the representation it quantizes.”
- Why it is harder than text tokenization: no natural reusable subwords, millions of long-tail items, sparse supervision per item, and a hard validity constraint — every generated ID must map to a real catalogue item (handled downstream by Trie-Constrained Decoding).
RQ-VAE residual-quantization tokenizer (offline, then frozen):
Algorithm: RQ-VAE Item Tokenization (build item→SID lookup)
──────────────────────────────────────────────────────────
Train phase (over item content embeddings x_i):
for each item i:
z ← Encoder(x_i) # latent vector
r ← z # residual r^(0)
for level ℓ = 1 .. L:
c[ℓ] ← argmin_k ‖ r - e[ℓ][k] ‖² # nearest codeword index
r ← r - e[ℓ][c[ℓ]] # subtract → next residual
ẑ ← Σ_ℓ e[ℓ][c[ℓ]] # quantized latent
x̂ ← Decoder(ẑ)
minimize ‖x_i - x̂‖² + L_rqvae # update encoder, decoder, codebooks
resolve collisions: append a unique suffix token to duplicate tuples
SID(i) ← (c[1], ..., c[L][, suffix]) # store frozen item→SID table
Inference (downstream generator):
generator decodes SID tokens autoregressively, one codebook level at a time
constrain each step to valid catalogue paths (trie); map SID back to item iConnections
- Is one of the three alignment paradigms in LLM-based Generative Recommendation (with text prompting and injecting collaborative signal)
- Core enabling step for Generative Recommendation / Generative Retrieval (and parallels document-ID generation in generative IR via the Differentiable Search Index)
- Builds on Residual-Quantized VAE / Product Quantization to produce Semantic IDs
- Alternative to Atomic Item IDs (the , codebook = catalogue special case)
- Feeds Autoregressive Decoding with Beam Search and Trie-Constrained Decoding for Next-Item Prediction
- Trained downstream with Supervised Fine-Tuning (SFT) (next-token cross-entropy) and optionally GRPO / Direct Preference Optimization (DPO)
- Contrasts with the score-and-rank skeleton of Sequential Recommendation models (SASRec, BERT4Rec, GRU4Rec)
- Affects Cold Start handling, Diversity, and Popularity Bias in the generated list