BERT4Rec

Definition

BERT4Rec

BERT4Rec (Sun et al., CIKM 2019) is a bidirectional Transformer recommender for Sequential Recommendation. Unlike causal/left-to-right models such as SASRec and GRU4Rec, it uses bidirectional self-attention so every position can attend to both left and right context. It is trained with a Cloze (masked-item) task borrowed from BERT: randomly mask a fraction of items in a user’s interaction sequence and predict them from the surrounding items. The motivation is that causal (unidirectional) attention may miss patterns in loosely ordered interaction data.

Intuition

Why mask instead of predict-the-next?

A strict left-to-right model (SASRec, GRU4Rec) only ever conditions on the past, so it learns transitions in one direction. But user histories are often only loosely ordered — what comes after an item is as informative as what came before. By masking interior items and forcing the model to reconstruct them from both sides, BERT4Rec learns richer item representations.

The Cloze objective also multiplies training signal: a single sequence of length with several masks yields several prediction targets, instead of just one next-item target per sequence.

Mathematical Formulation

Training (Cloze / masked-item task). Given a sequence , randomly replace a fraction of items with a special [MASK] token:

Each position embedding is the sum of an item embedding and a positional embedding, . The sequence passes through stacked Transformer encoder blocks (Multi-Head Attention → Add & Norm → position-wise Feed-Forward → Add & Norm, with dropout), each with bidirectional connections (no causal mask). A projection head over the final layer produces, for each masked position, a full-vocabulary softmax over items.

Loss — masked LM / cross-entropy over masked positions:

where:

  • — set of masked positions in the sequence
  • — the true item at masked position
  • — predicted probability from the Transformer + softmax over the entire item vocabulary
  • — model parameters

Inference (“mask at the end”). A next-item recommendation requires predicting position , but during training the model never sees a mask at the very end. So at inference a [MASK] is appended to the history and its predicted distribution is the next item:

Adding this “mask-at-the-end” as a second training stage raises performance, because it closes the train/inference mismatch between random-position masking and last-position prediction.

Key Properties / Variants

  • Bidirectional vs. causal. BERT4Rec is the bidirectional member of the Transformer-recommender family: SASRec uses a causal mask (each output from ); RNN methods (GRU4Rec) chain left-to-right; BERT4Rec lets every position see every other.
  • Item + positional embeddings. Like SASRec, the input is an item embedding plus a positional embedding; the difference is the attention pattern and the training objective.
  • Full-vocabulary softmax. The Cloze loss uses cross-entropy over all items, not pairwise/sampled negatives — this is a key contrast with SASRec’s original BCE-with-negative-sampling setup.
  • Strength / limitation (course comparison table). Strength: leverages bidirectional context, outperforming SASRec on multiple datasets in the original paper. Limitation: can be slower to train, and the gains may vary.
  • The loss-vs-architecture caveat (very important for the exam). [Klenitskiy & Vasilev, 2023] (“Turning Dross Into Gold: Is BERT4Rec Really Better Than SASRec?”) show that when SASRec is retrained with a full cross-entropy loss or BCE with many (3000) negatives (“SASRec+”), it beats BERT4Rec on all metrics on ML-1M. The takeaway: BERT4Rec’s apparent edge comes largely from its loss function and the number of negatives (full softmax avoids the overconfidence caused by too few negatives), not from bidirectionality per se. Losses (BPR / BCE / CE) are model-agnostic — any of these architectures can be trained with any of them.
  • Role in the generative-recommendation recap. BERT4Rec is one of the classical “score-and-rank” sequential models: it differs from SASRec/GRU4Rec only in how it encodes the history , but keeps the same skeleton of encoding history → scoring catalogue items. Generative recommenders (TIGER, OneRec) instead decode an item identifier rather than scoring a fixed candidate set. BERT4Rec also appears as a baseline for test-time-reasoning methods (Think Before Recommend reports ~+6% NDCG@20 on top of it).
Algorithm: BERT4Rec (Cloze training + mask-at-the-end inference)
────────────────────────────────────────────────────────────────
Embeddings: each item v_i -> e(v_i) + p_i  (item + positional)
 
# --- Training (masked-item / Cloze) ---
Loop over user sequences S = [v_1, ..., v_n]:
  M <- sample a fraction ρ of positions to mask
  S_masked <- replace S[i] with [MASK] for i in M
  H <- L stacked bidirectional Transformer encoder blocks(S_masked)
  for i in M:
    P(· | S_masked) <- softmax(Projection(H_i))   # over full item vocab
  L_MLM <- -(1/|M|) Σ_{i∈M} log P(v_i | S_masked)
  update θ by gradient descent on L_MLM
 
# (optional) second stage: mask only the last position to match inference
 
# --- Inference (next-item) ---
S_masked <- [v_1, ..., v_n, [MASK]]
H <- encoder(S_masked)
scores <- softmax(Projection(H_{n+1}))     # distribution over items
recommend top-k items by score

Connections

Appears In