BM25

BM25 (Best Matching 25)

BM25 is a probabilistic ranking function based on the binary independence model. It is the most widely-used unsupervised retrieval model and the standard baseline in information retrieval.

BM25 Scoring Function

where:

  • — term frequency of term in document
  • — length of document (in words)
  • — average document length in the collection
  • — term frequency saturation parameter (typically 1.2-2.0)
  • — length normalization parameter (typically 0.75)
  • — inverse document frequency

What Each Part Does

  • IDF: Rare terms are more informative → higher weight
  • TF saturation: First occurrence matters most; additional occurrences have diminishing returns (controlled by ). As , approaches raw TF; as , binary presence/absence.
  • Length normalization: Longer documents naturally have higher TF. Parameter controls how much to penalize length. : full normalization; : no normalization.

Parameter Effects

ParameterLowHigh
Binary (term present/absent)Raw term frequency
No length normalizationFull length normalization

Connections

Appears In