OneRec
Definition
OneRec
OneRec (Deng et al., Kuaishou, arXiv:2502.18965, 2025) is an end-to-end generative recommender that replaces the entire cascaded retrieve → pre-rank → rank pipeline with a single encoder–decoder model. It encodes a user’s interaction history and autoregressively generates the next item’s identifier as a sequence of hierarchical semantic-ID tokens (multi-codeword codes of the form
<item_a><item_b><item_c>), which are then mapped back to real catalogue items (short videos). Candidate generation and ranking are learned jointly rather than as separate stages.
Intuition
Generate the item, don't score the catalogue
A classical cascaded recommender funnels a corpus of videos down through Retrieval () → coarse ranking () → fine ranking () → a handful of recommendations. Each handoff loses information: a recall error in an early stage can never be recovered by a later ranker, and the overall Model FLOPs Utilization (MFU) of such systems is only ~0.1–1%.
OneRec instead treats recommendation like a language model task: the user history is a “prompt”, and the answer is the identifier of the next item, decoded token by token. Because items are tokenized into a small shared codebook, the model never needs a softmax over millions of atomic items, and it can be scaled (data + compute + parameters) like an LLM so a recommendation-native scaling law emerges. This is the “LLM-style end-to-end scaling” route to the LRM goal, as opposed to merely scaling one stage of the old pipeline.
Mathematical Formulation
The core mechanism is autoregressive decoding over a fixed-length semantic identifier. Each catalogue item is mapped (offline, by a quantization tokenizer such as RQ-VAE) to an -level semantic ID drawn from per-level codebooks. Given a user history , the encoder produces a context and the decoder factorizes the next-item identifier:
where:
- — the user’s chronological interaction history (each an item, expanded into its SID tokens, so the decoder reads an -longer flat token stream)
- — the hierarchical semantic ID of the target item; earlier codes are coarse (broad category), later codes refine the residual
- — the SID tokens already generated for this item (teacher-forced at training time)
- — identifier length (number of codebook levels); — codebook size per level, so codes can index billions of items with only token embeddings
Training uses standard next-token cross-entropy over the SID positions:
The identifier likelihood doubles as a relevance score for the item, , so recommendation is “decode valid identifiers” rather than “score a fixed candidate set”. On top of CE, OneRec adds a reward / preference fine-tuning stage (e.g. GRPO / DPO) to reward whole-list quality — validity, relevance, freshness, diversity — that pure next-token CE cannot express.
Key Properties / Variants
- Three-phase LLM-style pipeline: Data ⇒ Pre-Training ⇒ (Mid-Training) ⇒ Post-Training ⇒ Test-Time. In the “Open OneRec” framing a Qwen LLM decoder drives all phases; in evaluation the generated itemic tokens are mapped back to actual short-video items.
- Itemic + text tokens interleaved: each short video is passed through a tokenizer that emits itemic tokens
<item_a_1><item_b_1><item_c_1>(a hierarchical multi-codeword semantic ID, three codebook levels a/b/c) interleaved with text tokens describing user behaviour. - Architecture: an encoder–decoder Transformer (the variant OneRec-V2 moves toward decoder-only); supports long histories (~1000+) and is multi-modal.
- Unifies retrieve + rank: one model does candidate generation and ranking jointly, removing the cascade’s recall-error bottleneck and raising MFU toward LLM-level utilization.
- Decoding must stay valid: the SID space is far larger than the catalogue, so most code combinations correspond to no real item. Generation is paired with trie-constrained beam search (a per-step logit mask over valid catalogue paths) and/or a validity reward in GRPO so emitted IDs are grounded in the catalogue.
- Cold-start path: a new item is run through the frozen tokenizer to obtain its SID and its path is added to the trie — it becomes decodable immediately (sub-tokens already exist in the codebook), though being decodable is not the same as being recommended (the generator was trained on clicked items).
- Variants: OneRec-V2 (decoder-only scaling) and OneRec-Think (test-time reasoning before recommending). Kuaishou reports it serving the production main feed end-to-end.
Sketch of the inference flow (constrained beam search over SIDs):
Algorithm: OneRec inference (constrained generation)
─────────────────────────────────────────────────────
Input: user history H = (i_1,...,i_t); beam width B; SID length L; valid-SID trie T
Look up SID(i_j) for each i_j in H → flat token stream
context ← Encoder(token stream)
beams ← { empty prefix }
for level ℓ = 1..L:
candidates ← ∅
for each prefix p in beams:
allowed ← children of p in trie T # validity mask
for each token z in allowed:
score ← logp(p) + log p_θ(z | context, p) # autoregressive step
candidates ← candidates ∪ {(p·z, score)}
beams ← top-B candidates by score # prune the rest
SIDs ← B complete identifiers in beams
items ← trie/catalogue lookup of each SID
return filter(items) # drop history items, dedup, business rules → ranked listConnections
- Route to: Large Recommendation Models (LRM) — the “LLM-style end-to-end scaling” branch, contrasted with stage-wise cascade scaling
- Instance of: Generative Recommendation / Generative Retrieval — generate the item identifier instead of scoring candidates
- Item tokenization via: Semantic IDs built with RQ-VAE / Residual Quantization (the TIGER lineage)
- Decoding grounded by: Trie-Constrained Decoding / Constrained Decoding + Beam Search
- Fine-tuned with: GRPO / DPO reward optimization on top of next-token cross-entropy
- Builds on: Next-Item Prediction from Sequential Recommendation (SASRec, BERT4Rec, GRU4Rec)
- Related LRM backbones: HSTU (decoder-only, industrial scale), Scaling Laws for recommendation