Autoregressive Generation

Definition

Autoregressive Generation

Autoregressive generation factorizes the probability of an output sequence given an input into a product of next-token conditionals, decoding one token at a time, each conditioned on the input and all tokens already produced. In Generative Recommendation, the output is a fixed-length item identifier (e.g. a Semantic ID) and the input is the user’s interaction history. Instead of scoring a fixed candidate set, the model generates the next item’s id token by token and maps it back to a real catalogue item.

Intuition

Recommendation as Next-Token Prediction

User behaviour is already a sequence , so next-item prediction is structurally identical to a language model’s next-token prediction. The chain rule lets us write any joint distribution over a sequence as a left-to-right product of conditionals — so we never need a model over the whole catalogue at once, only a model of “what comes next given everything so far.” Each emitted token narrows the item down: with hierarchical ids the first token picks a coarse region (e.g. genre), and later tokens refine toward one specific item. This is exactly why the whole LLM decoding toolkit (Beam Search, sampling, Constrained Decoding) transfers to RecSys.

Mathematical Formulation

Autoregressive Factorization of the Item Identifier

where:

  • — user interaction history (the conditioning context)
  • — fixed-length identifier of item (atomic id is the special case )
  • — tokens already generated (the autoregressive dependence)
  • — identifier length; each position drawn from a codebook of size , giving possible codes
  • — Transformer parameters (encoder–decoder or decoder-only)

The identifier likelihood doubles as a score for the item, recovering the classical score-and-rank view:

Likelihood-as-Score

Higher likelihood = more compatible with the history. Recommendations are produced by decoding valid ids, not by scoring a fixed candidate set.

Training uses standard next-token cross-entropy with teacher forcing (the target id is known, so each position conditions on the true prefix):

Next-Token Cross-Entropy Loss

Identical to language-model training in every respect except that tokens are item codes from a small learned codebook () rather than a large subword vocabulary.

Key Properties / Variants

  • Two-direction inference vs. teacher-forced training. At training, the full target id is available, so all positions are scored in parallel against the true prefix. At inference, token depends on the model’s own previous predictions , making decoding inherently sequential ( steps per item).
  • Decoding strategies: greedy (top-1 each step) yields a single item; Beam Search keeps partial candidates per step to return a ranked top- list after steps; sampling / temperature injects diversity.
  • Validity problem: the code space has sequences but the catalogue uses only a tiny fraction, so most decoded sequences are not real items. Fixed by Trie-Constrained Decoding (logit mask restricting each step to tokens on a valid trie path, then renormalize) and/or GRPO reward shaping that rewards valid ids.
  • Architecture choices (both produce the code one token at a time): encoder–decoder (T5-style; TIGER, LETTER) reads the history fully then writes the id; decoder-only (GPT-style; HSTU, OneRec) treats [history || target] as one continuous stream and predicts the next token throughout.
  • Beyond CE: preference / RL fine-tuning (GRPO, DPO) optimizes list-level reward (validity, diversity, freshness) that token-level CE cannot express.
  • Decoding pathologies: amplification bias (popular prefixes dominate the beam, long-tail pruned early), homogeneity (top- share a prefix → near-duplicate list), local optima (greedy first token locks a region), and latency ( sequential steps + trie lookup per recommendation).
Algorithm: Constrained Autoregressive Decoding (one ranked list)
─────────────────────────────────────────────────────────────────
Input: history x, beam size B, id length L, valid-id trie T
Initialize beams ← { (prefix=<BOS>, score=0) }
for ℓ = 1 .. L:
  candidates ← {}
  for each beam (prefix, score) in beams:
    allowed ← T.children(prefix)            # validity mask
    for token z in allowed:
      lp ← log p_θ(z | x, prefix)           # renormalized over allowed
      candidates ← candidates ∪ { (prefix·z, score + lp) }
  beams ← top-B candidates by score          # prune the rest
return decode-ids(beams)                      # B complete, valid item ids

Connections

Appears In