TF-IDF
TF-IDF
TF-IDF (Term Frequency–Inverse Document Frequency) is a term weighting scheme that reflects how important a word is to a document in a collection. It combines two intuitions: terms that appear frequently in a document are important (TF), and terms that appear in few documents are more discriminative (IDF).
TF-IDF Weight
where:
- — term frequency (count of term in document , or log-scaled: )
- — inverse document frequency ( = total docs, = docs containing )
TF Variants
| Variant | Formula | Behavior |
|---|---|---|
| Raw | Linear with count | |
| Log-scaled | Sublinear — diminishing returns | |
| Boolean | if else | Presence/absence only |
| Augmented | Normalized by max TF |
Scoring with TF-IDF
Documents and queries are represented as TF-IDF weighted vectors in the Vector Space Model. Similarity is computed via cosine:
Connections
- Foundation of: Vector Space Model
- Extended by: BM25 (adds saturation + length normalization)
- Used in: Inverted Index for term weighting