Techniques in NLP which will boost your model.

Parth Mistry
Analytics Vidhya
Published in
4 min readAug 27, 2020

--

In my previous article, I showed how Tokenization, Stemming, and Lemmatization helps in providing features in the form of natural language that we as humans speak.

But this is not the best approach.

2 more approaches are very useful in optimizing, making the model efficient and accurate. They are:

  1. Bag Of Words
  2. TF-IDF

Bag Of Words

In plain English, Bag Of Words will keep track of the total occurrences of the most frequently used words. Those words will be treated as features and can be supplied to an algorithm.
It creates a dictionary of all the words occurring in the text.

This is the text we have:

Sentence 1: He is a good boy
Sentence 2: She is a good girl
Sentence 3: The Boy and the Girl are good.

There will be two steps in this task which are gathering stop words and lowering the case of the sentences. Therefore after performing it, the sentences will be like…

Sentence 1: good boy
Sentence 2: good girl
Sentence 3: boy girl good
Bag of words implementation.

This is how the words in the sentences will be converted to vectors where f1, f2, and f3 are the independent features.
As we saw in the previous blog that there are specific method calls to perform tasks like stemming and lemmatization. Bag Of Words has a bit different approach.

Firstly to get the unique words from the text we will do either stemming or lemmatization, and after that, we will implement a bag of words on top of that.
Scikit Learn has a class called Count Vectorizer which will make our work easier called Count Vectorizer.

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
x=cv.fit_transform(corpus).toarray()

This Python code will convert the sentences to vectors. Here the text will be in the “corpus” variable as a list.

This approach can be used in Sentiment Analysis where we just need to predict whether the sentence has a positive sentiment or a negative.

Term Frequency — Inverse Document Frequency (TF-IDF)

Term Frequency — Inverse Document Frequency a.k.a. TF-IDF is a statistical approach where it will rank each word is ranked concerning the document.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Term Frequency(TF)= Number of repeat words in sentence/ Number of words in sentence
Inverse Document Frequency(IDF) = log( Number of sentences/ Number of sentences containing words)
TF-IDF = tf * idf

This is how mathematically, the words will be given their rank with respect to their occurrence in the document.

The implementation of TF-IDF Vectorization on our examples will be as follows:

TF-IDF Implementation

All these conditions will be done under the hood just by writing a small code, which will use the TF-IDF Vectorizer from Scikit Learn.

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x = vect.fit_transform(corpus).toarray()

We will create an object of TfidfVectorizer class and the object will be provided a corpus of sentences that are either Lemmatized or Stemmed.

Special thanks to Aayush Jain for helping in this article. Here is a link to an NLP project which can help you to try it with real life data.

--

--

Parth Mistry
Analytics Vidhya

Just a student who is a machine learning and deep learning enthusiast.