P5

Definition

P5 (Pretrain, Personalized Prompt & Predict Paradigm)

P5 is a unified text-to-text recommendation framework (Geng et al., RecSys 2022) that casts every recommendation task as a natural-language prompt fed to a single encoder–decoder language model. Instead of a task-specific scoring head, P5 reformulates rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization as conditional text generation over a shared T5-style backbone. One model, one objective (next-token generation), many tasks — trained jointly via multitask personalized prompt learning and capable of zero-shot generalization to unseen prompts.

Intuition

"If everything is text, recommendation is just language modeling"

Classical recommenders learn a scoring function $f (user, item)$ and rank a fixed candidate pool. P5 makes a different bet: if you can describe a recommendation problem in words (“User_23 has bought item_7391, item_882 … what will they buy next?”), then a sufficiently strong sequence-to-sequence model can simply answer it in words.

This collapses the usual zoo of task-specific architectures into one shared model. The input prompt carries the user ID, item IDs, and history as plain text; the output is the answer (a predicted item ID, a rating like “4”, a yes/no, or a free-text explanation). Because the only objective is “predict the next token,” adding a new task means adding a new prompt template — not a new network. P5 is the bridge slide between Sequential Recommendation models and full Generative Recommendation: it shows recommendation as language, but it still generates text (titles, ratings, explanations) rather than catalogue-grounded item identifiers.

Mathematical Formulation

P5 is trained with a single conditional language-modeling (negative log-likelihood / cross-entropy) objective. Each training instance is a prompt-formatted input token sequence $x = (x_{1}, \dots, x_{n})$ paired with a target token sequence $y = (y_{1}, \dots, y_{m})$ , and the model factorizes the target autoregressively:

$L_{θ} = - \sum_{j = 1}^{m} lo g p_{θ} (y_{j} ∣ y_{< j}, x)$

where:

$θ$ — shared parameters of the encoder–decoder (T5 backbone); no task-specific heads.
$x$ — a personalized prompt: a natural-language template with the user/item fields filled in (e.g., user ID, item IDs, interaction history, target candidate).
$y$ — the target text answer (an item ID token sequence, a numeric rating, “yes”/“no”, or an explanation sentence).
$y_{< j}$ — previously generated target tokens (teacher-forced at training; autoregressive at inference).
$p_{θ} (y_{j} ∣ \cdot)$ — softmax over the shared text vocabulary.

Mechanism (how the model works):

Personalized prompt collection. Hand-design a family of prompt templates, one or more per task (rating, sequential, direct, explanation, review). Slots are filled with the user’s and items’ raw fields. User and item are referenced by atomic IDs rendered as text tokens (e.g., user_23, item_7391).
Tokenize as text. The filled prompt is byte/sub-word tokenized like any sentence; ID tokens such as item_7391 share the same vocabulary and embedding space as ordinary words. This is the L1 ID-based item tokenization rung — one (multi-piece) token group per item, with no shared semantic structure.
Multitask pretraining. All tasks are mixed into one corpus and trained with the single objective above; the encoder reads the prompt, the decoder generates the answer.
Inference. For a held-out (possibly unseen) prompt, the decoder generates the answer text; for ranking tasks, candidate item IDs are scored by their generation likelihood (and/or generated under beam search).

Key Properties / Variants

Unified text-to-text paradigm. “Pretrain, Personalized Prompt & Predict” — the five named task families (sequential recommendation, rating prediction, direct recommendation, explanation generation, review summarization) all share one model and one loss.
Personalized prompts. Prompts embed user/item identity, so the same template yields per-user-specialized inputs; this is the “personalized” P in P5.
Zero-shot generalization. Because tasks are expressed in language, P5 can respond to new prompt phrasings/tasks not seen at training time — a property inherited from instruction-style multitask training.
Training objective = SFT. P5 is trained by Supervised Fine-Tuning (SFT)-style next-token cross-entropy on positive sequences only; it has no explicit negatives, so the ranking margin is hard to learn (a noted limitation versus pairwise/RL/DPO objectives).
Item ID design = Atomic IDs (L1). Every item is its own ID token group: simple lookup, but the vocabulary grows with the catalogue, IDs carry no semantics, and similar items share no structure. This is precisely the weakness that motivates Semantic IDs (RQ-VAE codes) in later work like TIGER.
Generates text, not grounded IDs. P5’s output is text (titles/ratings/explanations); it does not guarantee outputs map to real catalogue items. The shift to catalogue-grounded item-identifier generation (with Trie-Constrained Decoding) is what defines GenRec proper (TIGER, OneRec).
Backbone. Encoder–decoder T5; contrasts with later decoder-only generative recommenders (HSTU, OneRec-V2).
Position in taxonomy. In the alignment taxonomy P5 sits under Item Tokenization (paradigm ③) as the L1/ID-based anchor; it is the canonical “recommendation as a language task” example.

Algorithm: P5 — Pretrain, Personalized Prompt & Predict
────────────────────────────────────────────────────────
Input: interaction data; prompt template set T (multitask)
Backbone: shared T5 encoder-decoder θ
 
# 1. Build multitask prompt corpus
Corpus = []
for each task t in {rating, sequential, direct, explanation, review}:
  for each user-item instance:
    template ← sample prompt from T[t]
    x ← fill template with user_id, item_ids, history   # personalized, as TEXT
    y ← target answer text (item_id / rating / yes-no / sentence)
    Corpus.append((x, y))
 
# 2. Pretrain (single objective over all tasks)
for batch (x, y) in Corpus:
  loss ← - Σ_j log p_θ(y_j | y_<j, x)      # next-token cross-entropy (SFT)
  θ ← θ - α ∇_θ loss
 
# 3. Inference (incl. zero-shot unseen prompts)
x* ← fill (possibly new) template with target user/items
y* ← decode_autoregressive(θ, x*)          # generated answer text
# for ranking: score candidates by generation likelihood / beam search

Connections

Instance of: Generative Recommendation — the “recommendation as a language task” entry point
Item identifier scheme: Atomic Item IDs (L1 rung of Item Tokenization); contrasted with Semantic IDs / RQ-VAE
Training objective: Supervised Fine-Tuning (SFT) (next-token cross-entropy, positives only)
Backbone family: Transformer Model encoder–decoder; relates to Next-Item Prediction framing
Successor / contrast: catalogue-grounded ID generation in TIGER and OneRec; needs Trie-Constrained Decoding for validity (P5 itself does not)
Builds on / sits beside: Sequential Recommendation models (SASRec, BERT4Rec, GRU4Rec) used as baselines
Broader route: LLM-based Recommendation / LLM4Rec — borrowing language-model scaling for recommendation
Sibling paradigm: LLM-as-Recommender / LLM-as-RS (text-title output) vs P5’s text-to-text formulation

Study Notes

Explorer

P5

P5

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks