DeepCT

DeepCT (Deep Contextualized Term Weighting)

DeepCT is a methodology that utilizes BERT to estimate context-aware term importance (weighting) for documents and queries. It maps BERT’s contextual embeddings to a single scalar importance score for each term, which can then be used to replace or augment traditional Term Frequency (TF) in a standard inverted index.

DeepCT Weighting

For a document , DeepCT predicts a weight for each term :

where:

  • is the predicted importance score from BERT ()
  • is a scaling factor to convert scores into integer frequencies (often )
  • These predicted scores replace the raw in the BM25 formula.

Context over Count

Traditional IR relies on raw Term Frequency (counting occurrences). DeepCT recognizes that a term appearing once in a critical, descriptive sentence (e.g., a title or summary) is often more important than a term appearing multiple times in tangential contexts. By using BERT, DeepCT “sees” the context and predicts how much a term actually contributes to the document’s meaning.

Key Features

  • Inverted Index Compatibility: Because it produces integer weights, the output can be stored in any standard search engine (like Lucene/Elasticsearch) without changing the retrieval architecture.
  • Improved Recall/Precision: Better identifies central terms while down-weighting “stop-word-like” occurrences of common terms in specific contexts.
  • Efficiency: Expensive neural computation is done offline during indexing; retrieval remains as fast as standard BM25.

Connections

Appears In