Stop Words

Stop Words

Stop Words are high-frequency words that carry very little semantic weight or information for the purpose of distinguishing documents. Examples include “the”, “is”, “at”, “which”, and “on”.

Why Remove Them?

  • Index Size: Stop words can account for 20-30% of the postings in an Inverted Index. Removing them significantly reduces disk space and memory usage.
  • Efficiency: Processing queries with “the” is computationally expensive because the posting list for “the” is massive.
  • Relevance: In vector-based models, high-frequency words can “wash out” the signal from rarer, more meaningful terms.

The Modern Perspective

Modern Information Retrieval (especially Neural IR and LLM-based systems) often keeps stop words. They are crucial for understanding phrases (“To be or not to be”), dependency structures, and local context. For large-scale web search, storage is cheap enough that the benefits of keeping them outweigh the costs.

Connections

  • Preprocessing pipeline: TokenizationStop Word RemovalStemming.
  • Term Weighting: TF-IDF naturally downweights stop words via the IDF component even if they aren’t explicitly removed.

Appears In