Collaborative Filtering

Definition

Collaborative Filtering (CF)

Collaborative filtering is a recommendation technique that predicts a user’s interest in items by leveraging the collective knowledge of a large pool of users — i.e. the observed user–item interaction data — rather than item content. The core assumption: users who agreed in the past will agree in the future. If a user is similar to a user , and liked an item has not yet seen, that item is a good recommendation for .

Formally, given users and items , CF frames recommendation as completing the sparse user–item interaction matrix : predict the missing entry (e.g. a rating, click, or purchase) for a (user, item) pair that has not been observed.

Intuition

Wisdom of the crowd, not content

Content-based methods look at what an item is (text, genre, metadata). Collaborative filtering ignores content entirely and looks only at how users behaved: “users who like artist X also like Y.” The signal comes from patterns of agreement across many users.

Picture the interaction matrix: rows = users, columns = items, most cells empty. CF fills a missing cell ? in one of two ways:

  • User-based: “How do users similar to me rate this item?”
  • Item-based: “How do I rate items similar to this one?” (usually preferred, because item–item relationships are more stable over time than user profiles.)

Mathematical Formulation

CF splits into two families. Neighborhood (memory-based) CF makes predictions directly from similarity; model-based CF trains a model (e.g. matrix factorization) from the data.

User-based rating prediction (neighborhood CF)

Predict the rating of user for item by averaging the ratings that ‘s nearest neighbors gave to :

where:

  • — set of the nearest neighbors of who have rated item (the -NN)
  • — rating that neighbor gave to item
  • similarity (defining “nearest”) is computed from shared interactions, e.g. cosine or Pearson correlation between rating vectors

(More general weighted variants weight each neighbor by its similarity and mean-center to correct for user rating bias.)

Matrix Factorization (model-based CF)

Approximately factorize the interaction matrix into a low-rank product of a user-factor matrix and an item-factor matrix:

where:

  • — each row is a user factor (preferences over latent concepts)
  • — each row is an item factor (properties over the same concepts)
  • — number of latent factors (the rank);
  • the predicted preference is the dot product of the user and item factors

Latent factors can be interpretable: a rank-2 movie example learns a history axis and a romance axis, and a rating reconstructs as (user’s affinity to history item’s history-ness) + (user’s affinity to romance item’s romance-ness). The general recipe is: define a model → define an objective → optimize (e.g. minimize regularized squared error over observed entries via SGD/ALS).

Key Properties / Variants

  • Two CF families:
    • Neighborhood / memory-based: no model trained in advance; relies on the similarity of two entities (-NN). Pros: simple, efficient, transparent. Cons: data sparsity, noise, scalability. Sub-types: User-based Collaborative Filtering and Item-based Collaborative Filtering (item-based preferred for stability).
    • Model-based: train a model from data (e.g. Matrix Factorization). Pros: scalability, generalization. Cons: complexity, black-box, overfitting with insufficient data.
  • Implicit vs explicit feedback: explicit = ratings; implicit = clicks/views/purchases (treat observed = positive, unobserved = candidate negatives, often with negative sampling).
  • Bayesian Personalized Ranking (BPR): a pairwise ranking objective for implicit feedback; optimizes that observed items are scored above unobserved ones — a standard CF baseline.
Algorithm: User-based Neighborhood CF (rating prediction)
──────────────────────────────────────────────────────────
Input: interaction matrix R, target (u, i), neighborhood size k
1. For every other user v that has rated item i:
     compute sim(u, v)   # e.g. cosine / Pearson over co-rated items
2. N_i(u) ← the k users with highest sim(u, v) that rated i
3. r_hat(u,i) ← (1 / |N_i(u)|) * Σ_{v in N_i(u)} r(v,i)
     # (weighted variant: Σ sim(u,v)·r(v,i) / Σ sim(u,v), mean-centered)
4. return r_hat(u,i)        # rank items by r_hat for top-N recommendation
  • Neural Collaborative Filtering (NCF): replaces the fixed dot product with a neural network over user/item embeddings, capturing non-linear interactions. He et al. (2017) showed MF is a special case of NCF (replace the neural layers with an element-wise multiplication layer, a fixed all-ones output weight, and identity activation recovers the dot product). NCF treats prediction as binary classification: weighted square loss (explicit) or binary cross-entropy (implicit), with negative sampling.
  • When CF fails: standard CF/MF treats interactions as an unordered set and ignores temporal order — motivating Sequential Recommendation (e.g. FPMC, GRU4Rec, SASRec). It also suffers the Cold Start Problem: new users/items have no interactions, so no collaborative signal exists.
  • Beyond accuracy: purely accuracy-optimal CF tends to recommend popular/similar items, hurting Diversity, Novelty, Coverage, and item-side Fairness in Recommendation (popularity bias, long-tail under-exposure, filter bubbles).
  • No universal winner: simple, well-tuned neighborhood CF often matches complex neural models (a recurring reproducibility finding); the best method depends on problem formulation, domain, and available data — hybrids frequently win.

Connections

Appears In