TFIDF Vectorizer
Published in
2 min readDec 28, 2018
In simple words, TFIDF is a numerical statistic that shows the importance of a word in a text document.
We create two text documents as follows:
text1 = "I love my cat but the cat sat on my face"
text2 = "I love my dog but the dog sat on my bed"
Word Tokenization
words1 = text1.split(" ")
words2 = text2.split(" ")
Print the words1
to see the following output:
['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']
Combining the Words into a Single Set
corpus = set(words1).union(set(words2))
print(corpus)
Output:
{'bed', 'but', 'sat', 'love', 'cat', 'my', 'face', 'the', 'I', 'dog', 'on'}
TFIDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()corpus = ["I love my cat but the cat sat on my face", "I love my dog but the dog sat on my bed"]X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in corpus]import pandas as pd
df = pd.DataFrame(X.T.todense(), index = feature_names, columns = corpus_index)
df.style
Output:
It is seen that the words ‘cat’, ‘my’ and ‘face’ are the most important features in the first sentence. And, words ‘dog’, ‘my’ and ‘bed’ are important features in the second sentence.
Before you leave,
If you enjoyed this post, please make sure to follow the NLP Gurukool page and visit the publication for more exciting tutorials and blogs on machine learning, data science and NLP.
Please get in touch if you would like to contribute to our publication.