Constrained Decoding
Definition
Constrained Decoding
Constrained decoding restricts an autoregressive generative recommender so that it can only emit token sequences corresponding to real catalogue items. In Generative Recommendation, an item is represented by a Semantic ID , and the model decodes one code token at a time. The semantic-ID space has possible sequences (e.g. ) but only a tiny fraction (e.g. ) are valid items, so unconstrained generation overwhelmingly produces invalid IDs. The standard solution stores all valid catalogue IDs in a trie and, at each step, masks the logits of every token that does not continue a valid prefix path.
Intuition
You Cannot Hallucinate an Item
In language generation any token sequence is a valid (if weird) sentence. In recommendation, most semantic-ID sequences point to no item at all — a code like may simply not exist in the catalogue. A free-decoding model would happily generate such phantom IDs, and the lookup would fail.
The trie is a road map of every legal path. Sitting at prefix , the trie says “from here you may only go to ” — every other token is a dead end and is forbidden. Walking the trie from root to a leaf is therefore guaranteed to spell out a real item. Validity is enforced by construction, not hoped for.
Mathematical Formulation
Logit Masking over Trie-Allowed Tokens
At decoding step , let be the codes generated so far and let be the set of tokens that extend this prefix along some valid path in the trie. The model’s distribution is masked and renormalized over only the allowed tokens:
where:
- — the raw logit (pre-softmax score) the model assigns to code token at this step
- — children of the trie node reached by following prefix (the valid continuations)
- — indicator; tokens not in get probability (logit set to before softmax)
- — the user history conditioning the generation
Equivalently the full autoregressive item probability is the masked product which is non-zero only when is a complete root-to-leaf path. The mask is the only change versus standard next-token decoding; the model weights are untouched.
Key Properties / Variants
- Guaranteed validity: every generated sequence is a real item — there is no post-hoc filtering of phantom IDs needed.
- Implementation = logit mask: the trie produces a per-step allowed-token set; tokens outside it have their logits set to , then softmax renormalizes. Cheap to apply on top of any decoder.
- Composes with Beam Search: the standard inference path is constrained beam search — maintain partial candidates, but only expand children that the trie permits, yielding valid ranked items after steps.
- Catalogue-sync cost: the trie must be rebuilt/updated whenever items are added or removed. In fast-moving systems this upkeep is non-trivial, and stale paths can keep surfacing dead items.
- Per-request filtering for safety: the trie is the only hard safety net in Generative Recommendation — it can be masked per user/locale to block NSFW, region-locked, recalled, or already-seen items before generation.
- Reward-based alternative (complementary): instead of masking, make “is this a real item?” part of an RL reward (GRPO): generate freely, reward valid IDs, penalize invalid ones. This needs no live trie but only makes validity likely, not guaranteed. Often the two are combined (trie at inference, validity reward in training).
- Used by: TIGER and most semantic-ID generative recommenders; the same idea drives generative IR document-ID decoding (GENRE, DSI).
Algorithm: Trie-Constrained Beam Search Decoding
─────────────────────────────────────────────────
Build trie T from all valid catalogue semantic IDs (offline / on catalogue change)
Input: user history x, beam size B, ID length L
beams ← [ (prefix=[], score=0, node=root(T)) ]
for ℓ = 1 .. L:
candidates ← []
for (prefix, score, node) in beams:
allowed ← children(node) # A(z_<ℓ): valid next codes
logits ← model(x, prefix) # raw scores over K codes
mask logits[k] ← -∞ for all k ∉ allowed
logp ← log_softmax(logits) # renormalize over allowed only
for k in allowed:
candidates.append( (prefix+[k], score+logp[k], child(node,k)) )
beams ← top-B candidates by score
return B complete semantic IDs → id-to-item lookup → ranked item listConnections
- Mechanism for: Generative Recommendation, Generative Retrieval (item/document-ID decoding)
- Operates over: Semantic IDs produced by the (frozen) RQ-VAE tokenizer
- Combined with: Beam Search (constrained beam search), Autoregressive Decoding
- Trained objective it sits on top of: next-token cross-entropy; optionally GRPO / DPO for a validity reward
- Contrast: Atomic Item IDs make every token trivially valid (no trie needed), but lose cold-start generalization
- Downside it does not fix: decoding pathologies like popularity amplification and homogeneity (see Beam Search, Maximal Marginal Relevance (MMR) for diversity)