Techniques in NLP which will boost your model.

Published in

Analytics Vidhya

4 min readAug 27, 2020

In my previous article, I showed how Tokenization, Stemming, and Lemmatization helps in providing features in the form of natural language that we as humans speak.

A Glance to Natural Language Processing (NLP)

A short introduction to Natural Language Processing and its beginner level techniques.

medium.com

But this is not the best approach.

2 more approaches are very useful in optimizing, making the model efficient and accurate. They are:

Bag Of Words
TF-IDF

Bag Of Words

In plain English, Bag Of Words will keep track of the total occurrences of the most frequently used words. Those words will be treated as features and can be supplied to an algorithm.
It creates a dictionary of all the words occurring in the text.

This is the text we have:

Sentence 1: He is a good boy
Sentence 2: She is a good girl
Sentence 3: The Boy and the Girl are good.

There will be two steps in this task which are gathering stop words and lowering the case of the sentences. Therefore after performing it, the sentences will be like…

Sentence 1: good boy
Sentence 2: good girl
Sentence 3: boy girl good

This is how the words in the sentences will be converted to vectors where f1, f2, and f3 are the independent features.
As we saw in the previous blog that there are specific method calls to perform tasks like stemming and lemmatization. Bag Of Words has a bit different approach.

Firstly to get the unique words from the text we will do either stemming or lemmatization, and after that, we will implement a bag of words on top of that.
Scikit Learn has a class called Count Vectorizer which will make our work easier called Count Vectorizer.

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
x=cv.fit_transform(corpus).toarray()

This Python code will convert the sentences to vectors. Here the text will be in the “corpus” variable as a list.

This approach can be used in Sentiment Analysis where we just need to predict whether the sentence has a positive sentiment or a negative.

Term Frequency — Inverse Document Frequency (TF-IDF)

Term Frequency — Inverse Document Frequency a.k.a. TF-IDF is a statistical approach where it will rank each word is ranked concerning the document.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Term Frequency(TF)= Number of repeat words in sentence/ Number of words in sentence

Inverse Document Frequency(IDF) = log( Number of sentences/ Number of sentences containing words)

This is how mathematically, the words will be given their rank with respect to their occurrence in the document.

The implementation of TF-IDF Vectorization on our examples will be as follows:

All these conditions will be done under the hood just by writing a small code, which will use the TF-IDF Vectorizer from Scikit Learn.

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x = vect.fit_transform(corpus).toarray()

We will create an object of TfidfVectorizer class and the object will be provided a corpus of sentences that are either Lemmatized or Stemmed.

Special thanks to Aayush Jain for helping in this article. Here is a link to an NLP project which can help you to try it with real life data.

Techniques in NLP which will boost your model.

A Glance to Natural Language Processing (NLP)

A short introduction to Natural Language Processing and its beginner level techniques.

Bag Of Words

Term Frequency — Inverse Document Frequency (TF-IDF)

Written by Parth Mistry