IR Chapter 4: Processing Text
Overview
Text processing (or text transformation) is the pipeline used to convert raw document text into consistent index terms. This process is fundamental to Information Retrieval, moving beyond simple “exact match” find features to handle linguistic variations, noise, and statistical properties of human language.
4.1 From Words to Terms
The primary goal of text processing is to normalize the many forms in which words occur into a standardized representation.
Index Terms
The representation of the content of a document used for searching.
Key Decisions in Text Processing:
- Case-sensitivity: Most search engines utilize case folding (lowercasing) to increase match probability.
- Punctuation: Stripping punctuation to simplify queries.
- Tokenization: Splitting text into individual tokens.
- Stop Words: Removing high-frequency, low-content words.
- Stemming: Grouping morphological variants (e.g., “run”, “running”).
4.2 Text Statistics
Language is highly predictable. Understanding the statistical distribution of words is crucial for ranking models and indexing.
Zipf’s Law
Zipf’s law describes the skewed distribution of word frequencies: a few words occur very frequently, while many occur only once.
Zipf's Law
The frequency of the -th most common word () is inversely proportional to its rank (): Or in terms of probability : For English, .
Key Implications:
- The top 50 words account for ~40% of all text.
- Roughly half of the unique words in a corpus occur only once (Hapax Legomena).
- Mandelbrot’s Modification: (allows for tuning to specific corpora).
Heaps’ Law (Vocabulary Growth)
Predicts how the vocabulary size () grows relative to the number of tokens in the collection ().
Heaps' Law
Typical values: and .
Result Set Size Estimation
Estimating the number of documents containing multiple query terms ().
- Independence Assumption: Warning: Often underestimates because words are semantically dependent.
- Conditional Probability:
4.3 Document Parsing
Parsing involves recognizing the content and structure of documents.
4.3.2 Tokenization
The process of forming words from a sequence of characters (lexical analysis).
Tokenization Challenges
- Small words: “XP”, “II”, “J Lo” can be significant.
- Hyphens: “e-bay” vs “ebay”, “wal-mart” vs “walmart”.
- Apostrophes: Possessives vs. contractions vs. names (“O’Donnell”).
- Numbers: Product IDs, dates, patent numbers.
General Strategy:
- Pass 1: Identify markup/tags (HTML/XML).
- Pass 2: Tokenize text content, typically treating non-alphanumeric characters as word terminators and converting to lowercase.
4.3.3 Stop Words (Stopping)
Function words (determiners, prepositions) that provide structure but little semantic content.
Stop Words
- Determiners: “the”, “a”, “an”, “that”
- Prepositions: “over”, “under”, “above”
Removal Benefits:
- Decreases Inverted Index size.
- Increases retrieval efficiency.
- Risk: Can break queries like “to be or not to be”. Modern systems often index all terms but may remove stop words from queries dynamically.
4.3.4 Stemming
Stemming (conflation) reduces inflected or derived words to a common base form (stem).
Stemming
Searching for “swimming” should match documents containing “swam” or “swims” by reducing them to the stem “swim”.
Types of Stemmers:
- Algorithmic: Rules based on suffixes.
- Porter Stemmer: Most popular for English. Uses 5 steps of suffix stripping.
- Porter2: Improved version with exception handling.
- Dictionary-based: Matches words against a lookup table.
- Krovetz Stemmer: Hybrid approach checking stems against a dictionary to ensure they are valid words.
Error Types:
- False Positive: Grouping unrelated words (e.g., “policy” / “police”).
- False Negative: Failing to group related words (e.g., “europe” / “european”).
4.3.5 Phrases and N-grams
While single words are the default, phrases offer higher precision.
- POS Tagging: Identifying simple noun phrases (Adjective + Noun or Noun + Noun).
- N-grams: Any sequence of words.
- Unigram: “tropical”
- Bigram: “tropical fish”
- Trigram: “tropical fish aquarium”
Summary
Text processing is the foundational stage of the IR pipeline. By understanding the statistical nature of text (Zipf/Heaps) and applying transformations like Tokenization, stopping, and Stemming, search engines create a searchable Bag of Words or phrase representation that balances efficiency with retrieval effectiveness. Correct normalization ensures that the Term Weighting (e.g., BM25) can accurately estimate relevance.