Negative Sampling

Definition

Negative Sampling

Negative sampling is a training trick for Implicit Feedback recommenders: instead of treating every un-interacted user–item pair as a negative, the model draws a small random subset of unobserved (user, item) pairs to serve as negatives for each observed positive. It converts an intractable “score-all-items” objective into a cheap mini-batch objective while still teaching the model to rank positives above negatives.

Intuition

Why we sample negatives

With Implicit Feedback (clicks, plays, purchases) we only observe positives — the items a user interacted with. The interaction matrix is enormous and almost entirely “missing,” so naively labelling all items a user did not touch as 0 gives a hugely imbalanced classification problem with training instances. Computing a full-vocabulary softmax/loss over millions of items is too expensive.

The trick: for each observed positive, sample only a handful of unobserved items as “negatives.” The gradient still pushes the positive’s score up and the sampled negatives’ scores down, so on average the model learns the same ranking — at a tiny fraction of the cost. This is the move behind BPR, GRU4Rec, SASRec, and Neural Collaborative Filtering.

Mathematical Formulation

For an observed positive interaction , sample a set of negatives from the items has not interacted with, and optimize a contrast between the positive score and the negative scores .

Pointwise form — Binary Cross-Entropy (Implicit Feedback, used by SASRec / Neural Collaborative Filtering):

where:

  • — number of samples per positive (1 positive + several sampled negatives)
  • — label: for the observed positive, for each sampled negative
  • — predicted relevance probability (sigmoid of the model score)

Pairwise form — BPR / GRU4Rec loss (rank positive above sampled negative):

where:

  • — observed positive (true next/relevant item), — sampled negative
  • — model scores for positive and negative
  • — sigmoid; the loss is minimized when

Cost contrast. A full CE objective normalizes over the whole catalog, , costing per example; negative sampling replaces the with a sum over sampled items.

Key Properties / Variants

  • Sampling distribution :
    • Uniform — every non-interacted item equally likely; simplest, cheap.
    • Popularity-based — sample proportional to item frequency (or frequency, the word2vec choice); harder, more informative negatives.
    • In-batch negatives — reuse other positives in the same mini-batch as negatives (free, common in two-tower / Dense Retrieval training).
  • Hard negatives: deliberately sample high-scoring (currently “confusable”) negatives to sharpen the decision boundary, rather than easy random ones.
  • Number of negatives matters a lot. Too few negatives can cause overconfidence and weak ranking; reproducibility work on sequential models (Klenitskiy & Vasilev, 2023) showed that SASRec trained with BCE + ~3000 negatives (or full CE) beats BERT4Rec — i.e. the loss + negative count, not the architecture, drove the reported gap.
  • Bias trade-off: sampled losses approximate the full softmax/CE objective; more (and harder) negatives reduce bias toward the full-catalog ranking at higher compute cost.
  • Where it appears in training: used by Neural Collaborative Filtering (slide 45: “reduce the number of training unobserved instances”), BPR, GRU4Rec, and SASRec (one positive + several negatives per position with a causal mask).
Algorithm: Training with Negative Sampling (per positive interaction)
─────────────────────────────────────────────────────────────────────
For each observed positive (u, i):           # i = item u interacted with
  Draw N_S negatives {j_1, ..., j_{N_S}} ~ P_n,  j ∉ items(u)
  Compute scores  r_hat(u, i) and r_hat(u, j_k) for all k
  Pointwise:  loss = BCE(label=1, r(u,i)) + Σ_k BCE(label=0, r(u,j_k))
  -- or --
  Pairwise:   loss = -Σ_k log σ( r(u,i) - r(u,j_k) )
  Backpropagate; update embeddings of u, i, and sampled j_k only

Connections

Appears In