Building a Sentiment Analyzer With Naive Bayes

Published in

The Startup

9 min readAug 31, 2020

Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in the source material and helping a business to understand the social sentiment of their brand product or service while monitoring online conversations. So basically, here I have used the IMDb 50k Movie Review dataset to predict whether a given movie review strikes the positive sentiment or the negative sentiment. I have used Naive Bayes here because it outperforms most other Machine Learning Algorithms when the data is textual. Though I will be using some NLP libraries, the basic focus would be on using Naive Bayes. The accuracy of my predictions is coming out to be more or less 89 %. So I would say not bad. You can use other techniques like Bert or various Deep Learning techniques to increase the accuracy further.

Let us take our baby steps towards Natural Language Processing using Naive Bayes:33

To know About Naive Bayes you can Refer the following:

What is so Naive about Naive Bayes and how does it deal with time and space?

Naive Bayes is one of the simplest Machine Learning Algorithms. Most of the Machine Learning courses start with this…

medium.com

Let’s get Started

So our target would be to convert the entire textual reviews into a Bag of Words i.e to convert each unique word in our dataset to a column name and simply storing the frequency count of each word in each row of a review. Our steps involved in the process would be:

Text preprocessing
Vectorize(Bag of Words)
Creating a Machine Learning Model
Deployment

Text Preprocessing

So at first, we have to analyze and clean the data before fitting into the ML models otherwise we will get…..

The steps involved in data cleaning are

Remove HTML tags
Remove special characters
Convert everything to lowercase
Remove stopwords
Stemming

We will import the necessary libraries at first that we are going to need for our sentiment analyzer.

First, we are going to need NumPy and pandas, our essential data science tools. “Re” stands for regular expression which is used to extract a certain portion of a string. Nltk is an NLP library and we are going to import it in certain parts of our code to process the textual data. Then we are going to import sklearn for model creation. We are also importing some metrics from sklearn to analyze model performance.

Then we will import our dataset and casually go through it just to get a rough idea about the data provided to us.

So we have 50000 rows with only one feature column that is the “review”. You can already see the HTML tags that need processing.

There are know missing values as we can see from above. Phew!!

Now we will convert the positive sentiment with 1 and the negative sentiment with -1. We get

Removing HTML tags

We will now remove the HTML tags with the help of “regular expression” library from python. This is used to extract a part of the string that follows a certain pattern. For example, if somehow the phone number and the email id are merged into one column and we want to create two separate columns, one for the phone numbers and the other for the email id. It would be impossible for us to manually compute each row. In that case, we use regular expression (also known as regex). To know more about regex click here.

As you can see all the HTML tags have been removed.

Remove special characters

We don’t want the punctuation signs or any other non-alphanumeric characters in our Bag of Words, so we will remove those.

All the non-alphanumeric characters have been removed.

Convert everything to lowercase

For better analysis, we will convert everything to lower case.

Removing stopwords

Stopwords are those words that might not add much value to the meaning of the document. So converting these words into our Bag of words column would be a waste of time and space. These will add an unnecessary feature to our dataset and may affect our correctness of the predictions. These are articles, prepositions, or conjunctions like “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc. The “nltk” library of python for Natural Language Processing comes with a class where all the probable stopwords are stored. For this purpose, we import “stopwords” from the “nltk. corpus” library for processing stop words.

It returns a list of all the words without stopwords.

Stemming

This means words that have different forms of the same common word has to be removed. The basic agenda of stemming is reducing a word to its word stem that affixes to suffixes and prefixes or the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Suppose we are given words like playing, played, play- all these words have the same stem word known as “play”. The only word that would be useful as a vector in our Bag of Words is “play”. Other words would not contribute significant meaning to our dataset or predictions and these are unnecessary. The “nltk” library again comes with a class for stemming words. Here we import “SnowballStemmer” from the “nltk.stem” library for the purpose.

With this, we are done with our text processing.

Forming Bag of Words and implementing them into our model

As I have mentioned earlier, like all other Natural Language Processing methods we have to vectorize all the unique words and store the frequency of each word as their datapoint. In this article, we will be vectorizing the unique words with

CountVectorizer
TfidfVectorizer

We will construct a separate model for both of these vectorizers and check their accuracy.

Building Model with CountVectorizer

CountVectorizer simply converts all the unique words into columns and store their frequency count. It is the simplest vectorizer used in Machine Learning.

Now we will split the data

Then we will create our models and fit the data into them. Here we will be using GaussianNB, Multinomial NB, and Bernoulli NB.

We will understand the performance of the model by calculating the accuracy score, precision score, and recall score. You will say we can very well say the performance of the model by calculating the accuracy score. But it is not that simple dear. So let us understand it. In our classification model, we can predict the performance considering the following factors:

True Positives (TP) — These are the correctly predicted positive values which mean that the value of the actual class is positive and the value of the predicted class is also positive.
True Negatives (TN) — These are the correctly predicted negative values which mean that the value of the actual class is negative and the value of the predicted class is also negative.
False Positives (FP) — When the actual class is negative and the predicted class is positive.
False Negatives (FN) — When actual class is positive but predicted class is negative.

Accuracy — Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost the same. Therefore, you have to look at other parameters to evaluate the performance of your model

Accuracy = TP+TN/TP+FP+FN+TN

Precision -is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all reviews that were labeled as positive, how many were positive?

Precision = TP/TP+FP

Recall -is the ratio of correctly predicted positive observations to all observations in actual class — yes. The question recall answers are: Of all the truly positive reviews, how many did we label?

Recall = TP/TP+FN

So let us check the scores:

By far Bernoulli NB is acting as the best predictor here with an accuracy of 82.8%, precision of 81.6 %, and recall of 84.5 %. Thus we can say the model is performing well on the data.

We can further increase the performance of Gaussian Naive Bayes

In Gaussian naive Bayes the algorithm assumes the data to be normally distributed for a particular feature. This assumption is very wrong. In most cases, the data would not be normally distributed. It might be skewed. Moreover, the smoothening factor also varies depending on the data which determines the accuracy of GaussianNB. So here we will use the PowerTransformer class which makes the numerical features normally distributed so they can perform well on GaussianNB. Then we will use GridSearchCV to find the best smoothening factor.

Thus we see the GaussianNB accuracy score increased from being 78 percent to 80 percent here. So clearly it has enhanced the performance of Gaussian Naive Bayes here. The best smoothening factor is coming out to be 0.00351.

Now let’s build the model with Tfidfvectorizer

CountVectorizer Vs TfidfVectorizer

In CountVectorizer we only count the number of times a word appears in the document which results in bias in favor of most frequent words. this ends up ignoring rare words that could have helped us in processing our data more efficiently.

To overcome this, we use TfidfVectorizer.

In TfidfVectorizer we consider the overall document weightage of a word. It helps us in dealing with the most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents. Thus the integrity of the rare words which possess a lot of information regarding a document is preserved.

So as above we will vectorize the data, split the data, and then we will create out models.

Note:TfdifVectorized data cannot be implemented by Gaussian NB

Now we will fit the data into the models and calculate their scores.

Wow. We observe significant changes in the scores as compared to CountVectorizer. Here MultinomialNB is giving the best accuracy score of around 88.3% with a precision of 88.2% and recall of 88.4% percent. Thus the performance of the analyzer has significantly increased with the TF-IDF vectorizer. So we shall use this in our project deployment.

Chi-Square Test

To enhance the performance of the analyzer more, we can perform the chi-square test which helps us ineffective sampling as well as helps us to predict which features carry more importance than others. When our dataset becomes huge with lakhs of features, there will be some features that do not contribute much information about the dataset. We can very well drop those features using the Chi-Square test. Moreover, an effective way of NLP is to rank the features by the chi-square test and then use it in our models. Further knowledge can about the chi-square test be found here.

I have kept this entire article as simple as possible. I hope you find the information useful. :)))))