Beam Search
Definition
Beam Search
Beam search is an approximate decoding algorithm for autoregressive sequence models that keeps the highest-scoring partial sequences (the beam) at every generation step, rather than committing to the single best token like greedy decoding. In GenRec, the model decodes an item’s Semantic ID one code at a time; beam search expands all live prefixes by every candidate next code, scores the resulting continuations by cumulative log-probability, and prunes back to the top . After steps it emits complete SIDs, which become a ranked top- list of recommended items.
Intuition
A Ranked List, Not Just One Answer
Greedy decoding keeps only the top-1 next code at each step. That is fine if you want the single next item, but recommendation needs a ranked list. Beam search is the cheap middle ground between greedy (track 1 path) and an exhaustive search over all possible code sequences (intractable). By carrying hypotheses forward, an item whose first code was only the 2nd- or 3rd-best can still survive to the end and end up ranked highly — something greedy can never recover. The width trades compute and list size against the risk of pruning a good item too early.
Mathematical Formulation
The generator factorizes the SID likelihood autoregressively (the same chain rule as a language model):
A partial hypothesis is scored by its cumulative log-probability, and at each step the beam is the arg-top- over all one-code extensions of the previous beam:
where:
- — encoded user interaction history (the conditioning context)
- — the item identifier (Semantic ID) being decoded
- — identifier length (number of autoregressive steps per item)
- — beam width (number of partial hypotheses kept per step; also the size of the final list)
- — codebook vocabulary of allowed next codes (size ); in constrained decoding this is restricted by the trie
- — cumulative log-probability score of a partial sequence
- arg-top- — selection of the highest-scoring continuations; the rest are pruned
Final output: the completed sequences in , sorted by , give the ranked recommendation list (e.g., ranked items).
Key Properties / Variants
- Per-step cost. Each item costs sequential decoder steps; at each step all beams are expanded over codes and re-pruned. Inference cost is roughly plus a trie lookup per step — at catalogue scale this dominates latency (real systems target <50 ms).
- Length normalization. Raw cumulative log-prob favors shorter sequences; when comparing variable-length hypotheses, scores are typically divided by length. In GenRec all SIDs are fixed length , so this is usually unnecessary.
- Constrained beam search. Most of the code combinations are not real items (the validity problem). Beam search is paired with Trie-Constrained Decoding: a trie stores all valid catalogue SIDs, and at each step a logit mask zeros out codes that do not extend a valid prefix, renormalizing the distribution over allowed codes only. Every emitted sequence is then guaranteed to be a real item. See also Constrained Decoding.
- Greedy = . Greedy decoding is beam search with width one. Atomic-ID generation is the degenerate case , so the whole item is decoded in a single step and “beam search” reduces to taking the top- codes directly.
- Decoding pathologies (hit GenRec hard).
- Amplification / popularity bias — popular SID prefixes (e.g. ) dominate the beam, so long-tail items get pruned early (see Popularity Bias, Long Tail).
- Homogeneity — surviving beams share long prefixes, so the top- items are near-duplicates (e.g. five similar action films); hurts Diversity and Novelty.
- Local optima — a greedy-ish first-code choice locks in a region of SID space; a better item under a different prefix is unreachable.
- Diversity-aware variants. Diverse beam search splits hypotheses into groups and penalizes a group for repeating earlier choices; sampling / temperature injects randomness so a less-likely opening code can survive; post-hoc re-ranking with MMR greedily prefers items unlike those already chosen. RL fine-tuning (GRPO) or tokenizer design (LETTER) can also attack homogeneity upstream.
Algorithm: Constrained Beam Search over Semantic IDs
─────────────────────────────────────────────────────
Input: history encoding h, beam width B, id length L,
validity trie T over catalogue SIDs
beam ← { (prefix=∅, score=0) } # one empty hypothesis
for ℓ = 1 .. L:
cand ← ∅
for (prefix, score) in beam:
allowed ← T.next_tokens(prefix) # trie mask: valid codes only
logits ← model(h, prefix)
for z in allowed:
cand ← cand ∪ { (prefix·z, score + log softmax(logits)[z]) }
beam ← top-B of cand by score # prune to width B
return sort(beam by score) # B complete SIDs = ranked list
# then filter seen / dedup / business rulesConnections
- Operates on: Autoregressive Generation / Autoregressive Decoding of a generative recommender
- Decodes: Semantic IDs (or Atomic Item IDs in the case)
- Paired with: Trie-Constrained Decoding, Constrained Decoding to guarantee valid items
- Special case: greedy decoding (); contrasted with full search over codes
- Suffers from: Popularity Bias, Long Tail pruning, low Diversity / Novelty (homogeneous beams)
- Diversity remedies: Maximal Marginal Relevance (MMR), diverse beam search, sampling; upstream via GRPO RL or LETTER tokenizer
- Produces: a top-K ranked list for the next-item task