Creating a TF-IDF in Python

From scratch in python code

Iftekher Mamun
5 min readJun 19, 2019
Source

As part of a technical interview, I was asked to implement a pseudo code of TF-IDF in python. Given my relatively new experience with NLP library, it is sufficient to say that I did not do a great job at explaining my code. So to make sure this does not happen again in the future, let’s just do that here.

But before all of that, I think a simple explanation of what NLP stands for is necessary:

NLP stands for Natural Language Processing. Not to be confused with Neuro-Linguistic Programming

It’s basically speaking a language. For programmers, it is getting the computer to read and understand our spoken and written language. NLP contains domains within it self known as corpus (contains many words/documents). We use the NLTK library to read NLP.

With that simple relationship established, let’s move on to the TF-IDF part of the post. TF-IDF stands for Term Frequency- Inverse Term Frequency. The TF parts counts how many times a word has occurred in a given corpus . Since a corpus is made up of many documents, each documents and its words will have their own TF count. IDF part counts for how rarely a word occurs within a document. The rare the word is the higher its count/ weight will be.

TF-IDF
Combining TF with IDF

There is a great example on Free Code Camp, that we will use as our example as well:

Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

Free Code Camp

You can also do this in Excel using the log function. But here we see that unique words, ie words that only occured in once and only in one of the documents, is weighted heavily compared to the words that appeared more than once. We use TF to count total words per documents and use IDF to define how rare each words are. The more rare the words- car, truck, road, highway- the more significant they are. This comes into play when we make a sparse matrix. In a sparse matrix, majority of words in a given corpus will be common words and will play no role in determining the difference between each documents. However, when the words are unique, they will help the program learn how to differentiate between each documents.

To make TF-IDF from scratch in python, we need two separate steps. First we have to create the TF function to calculate total word frequency for all documents. Here are the codes below:

#import necessary librariesimport pandas as pd
import sklearn as sk
import math (this is for the IDF portion only)
#load up our sample sentencesfirst= 'The car is driven on the road'
second= 'The truck is driven on the highway'
#split so each word have their own string
first = first.split(" ")
second= second.split(" ")
#join them to remove common duplicate words
total= set(first).union(set(second))
print(total): {'The', 'car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck'}
#Now lets add a way to count the words using a dictionary key-value pairing for both sentences
wordDictA = dict.fromkeys(total, 0)
wordDictB = dict.fromkeys(total, 0)
for word in first:
wordDictA[word]+=1

for word in second:
wordDictB[word]+=1
#put them in a dataframe and then view the result:
pd.DataFrame([wordDictA, wordDictB])
#Now writing the TF function:def computeTF(wordDict, bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count/float(bowCount)
return tfDict
#running our sentences through the tf function:
tfFirst = computeTF(wordDictA, first)
tfSecond = computeTF(wordDictB, second)
#Converting to dataframe for visualization
tf_df= pd.DataFrame([tfFirst, tfSecond])

And now that we finished the TF section, we move onto the IDF part:

#creating the log portion of the Excel table we saw earlier
def computeIDF(docList):
idfDict = {}
N = len(docList)

idfDict = dict.fromkeys(docList[0].keys(), 0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] += 1

for word, val in idfDict.items():
idfDict[word] = math.log10(N / float(val))

return idfDict
#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])
#The actual calculation of TF*IDF from the table above:
def computeTFIDF(tfBow, idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val*idfs[word]
return tfidf
#running our two sentences through the IDF:
idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
IDF values. Check if it matches with the Excel table above

That was a lot of work. But it is handy to know, if you are asked to code TF-IDF from scratch in the future. However, this can be done a lot simpler thanks to sklearn library. Let’s look at the example from them below:

#first step is to import the library
from sklearn.feature_extraction.text import TfidfVectorizer
#for the sentence, make sure all words are lowercase or you will run into error. for simplicity, I just made the same sentence all lowercase
firstV= 'the car is driven on the road'
secondV= 'the truck is driven on the highway'
#calling the TfidfVectorizer
vectorize= TfidfVectorizer()
#fitting the model and passing our sentences right away:
response= vectorize.fit_transform([firstV, secondV])
print(response) (0, 6) 0.6043795515372431
(0, 0) 0.42471718586982765
(0, 3) 0.30218977576862155
(0, 1) 0.30218977576862155
(0, 4) 0.30218977576862155
(0, 5) 0.42471718586982765
(1, 6) 0.6043795515372431
(1, 3) 0.30218977576862155
(1, 1) 0.30218977576862155
(1, 4) 0.30218977576862155
(1, 7) 0.42471718586982765
(1, 2) 0.42471718586982765

If you match up the above numbers with their respective location, you will see they match up with the build from scratch method. In the first column, 0 is the first sentence and 1 is the second sentence. It is a lot quicker and easier to use this method but it is not that easy to read and understand.

A big thank you to the Free Code Camp post, without which this would not have been possible. They are linked below.

--

--