Text Representation: TF-IDF

#13 - 100 Days of NLP

Nov 04, 2024

Welcome back 👋, On Day 13 of our 100 Days of NLP journey. Now that we’ve explored One-Hot Encoding and the Bag of Words (BoW) model, it's time to level up our text representation with TF-IDF (Term Frequency-Inverse Document Frequency). This approach combines frequency with relevance, helping us identify not just which words are common, but which ones are meaningful across multiple documents.

Let’s dive in and learn why TF-IDF is a powerful text representation technique for NLP! 🚀

What is TF-IDF?

TF-IDF is a technique that assigns a weight to each word in a document based on:

Term Frequency (TF): How frequently a word appears in a document.
Inverse Document Frequency (IDF): How common or rare the word is across all documents in a corpus.

Together, TF and IDF help us measure how important a word is to a document in the context of a larger collection.

How it is Calculated?

Term Frequency (TF): This simply measures how often a term appears in a single document.

Since a term might appear more frequently in a longer document than in a shorter one, we normalize these counts by dividing the frequency by the document’s length.

Inverse Document Frequency (IDF): This measures the rarity of the word across the entire corpus. Words that appear in many documents are less informative than those appearing in fewer documents.

IDF reduces the weight of terms that are very common across a corpus while increasing the weight of rarer terms.

TF-IDF Score is then calculated by multiplying TF and IDF for each term:

TF-IDF score = TF * IDF

TF-IDF in Python

Let’s see how we can implement TF-IDF in Python using TfidfVectorizer from scikit-learn.

Additionally, if you'd like to visualize it across features, you can use pandas to display it as a DataFrame.

Advantages of TF-IDF

Focus on Relevance: Unlike BoW, TF-IDF emphasizes meaningful words by balancing term frequency with document frequency.
Effective for Filtering Out Common Words: TF-IDF automatically gives less importance to common words (e.g., “the,” “is”) and prioritizes unique terms.

Limitations of TF-IDF

Context Limitations: TF-IDF treats words independently, ignoring the order or context of words.
Sparse Representation: Like BoW, TF-IDF results in a sparse matrix, which can be challenging to manage with large datasets.
Limited for Semantic Meaning: It doesn’t capture relationships between words (e.g., synonyms).
OOV: They cannot handle Out-of-vocabulary words.

Use Cases of TF-IDF

Information Retrieval: Used by search engines to rank documents based on relevance to a query.
Text Classification: Commonly used in document classification tasks, such as identifying spam emails.
Text Similarity: Helps in calculating similarity scores between documents.

What’s Next?

Now that we’ve covered TF-IDF, we’re ready to move deeper into word embeddings, where we capture more nuanced meaning and relationships between words. Join us tomorrow as we dive into GloVe (Global Vectors for Word Representation), where we’ll unlock new ways to represent words! 🌐

And that’s it for Day 13! 🎉 If you enjoyed this post, share it with others exploring NLP, and don’t forget to subscribe. We’re building up to some exciting projects! 😄📚