TFIDF Vectorizer

Karan Arya
NLP Gurukool
Published in
2 min readDec 28, 2018

In simple words, TFIDF is a numerical statistic that shows the importance of a word in a text document.

We create two text documents as follows:

text1 = "I love my cat but the cat sat on my face"
text2 = "I love my dog but the dog sat on my bed"

Word Tokenization

words1 = text1.split(" ")
words2 = text2.split(" ")

Print the words1 to see the following output:

['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']

Combining the Words into a Single Set

corpus = set(words1).union(set(words2))
print(corpus)

Output:

{'bed', 'but', 'sat', 'love', 'cat', 'my', 'face', 'the', 'I', 'dog', 'on'}

TFIDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
corpus = ["I love my cat but the cat sat on my face", "I love my dog but the dog sat on my bed"]X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in corpus]
import pandas as pd
df = pd.DataFrame(X.T.todense(), index = feature_names, columns = corpus_index)
df.style

Output:

It is seen that the words ‘cat’, ‘my’ and ‘face’ are the most important features in the first sentence. And, words ‘dog’, ‘my’ and ‘bed’ are important features in the second sentence.

Before you leave,

If you enjoyed this post, please make sure to follow the NLP Gurukool page and visit the publication for more exciting tutorials and blogs on machine learning, data science and NLP.

Please get in touch if you would like to contribute to our publication.

--

--