TF-IDF

TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) is a term weighting scheme that reflects how important a word is to a document in a collection. It combines two intuitions: terms that appear frequently in a document are important (TF), and terms that appear in few documents are more discriminative (IDF).

TF-IDF Weight

$tf-idf (t, d) = tf (t, d) \times idf (t)$

where:

$tf (t, d)$ — term frequency (count of term $t$ in document $d$ , or log-scaled: $1 + lo g tf$ )

$idf (t) = lo g \frac{N}{df ( t )}$ — inverse document frequency ( $N$ = total docs, $df (t)$ = docs containing $t$ )

TF Variants

Variant	Formula	Behavior
Raw	$f (t, d)$	Linear with count
Log-scaled	$1 + lo g f (t, d)$	Sublinear — diminishing returns
Boolean	$1$ if $t \in d$ else $0$	Presence/absence only
Augmented	$0.5 + 0.5 \cdot \frac{f ( t , d )}{m a x _{t} f ( t , d )}$	Normalized by max TF

Scoring with TF-IDF

Documents and queries are represented as TF-IDF weighted vectors in the Vector Space Model. Similarity is computed via cosine:

$sim (q, d) = \frac{q \cdot d}{∣ q ∣ \cdot ∣ d ∣}$

Connections

Foundation of: Vector Space Model
Extended by: BM25 (adds saturation + length normalization)
Used in: Inverted Index for term weighting

Study Notes

Explorer

TF-IDF

TF-IDF

TF Variants

Scoring with TF-IDF

Connections

Appears In

Graph View

Table of Contents

Backlinks