Sentiment Analysis Project — with traditional ML & NLP

10 min readAug 12, 2022

Using Bag-of-Words Representation & Naive Bayes Classifier

In this article, we are building a business relevant Sentiment Analysis Project for a dummy client ABC Restaurants end-to-end. Happy learning :)

Watch the video tutorial instead

We have done a video tutorial on this project. If you are more of a video person, go ahead and watch it on youtube. FYI, we launch new machine learning projects every week. So, make sure to subscribe to our channel to get access to all of our free ML courses.

All project related files are kept on Google Drive. On this note, let’s get started..

So, what’s Sentiment Analysis anyway?!

I saw this coming, haha!! Let’s understand a Sentiment Analysis problem from a business standpoint.

The good news is: with the power of internet, businesses today get a huge number of customer feedbacks through their business website, social media page, business listings, etc. However, the bad news is: majority of businesses do not even know how to use this information to improve themselves.

Even the ones who do know, primarily focus on structured customer feedback, like a review on google, amazon, etc. Structured feedbacks have a rating, along with the written review, which makes it easy to understand if a feedback is positive or negative. As is the case here..

These are reviews of a restaurant we picked from google. Here, clearly we may see that ratings are a good proxy for review sentiment. And, aggregated ratings also signify the overall sentiment for this restaurant, which is seemingly positive.

However, unstructured feedback is where the volume lies. Over a billion people worldwide use facebook/ instagram and half a billion use twitter. A sizeable number of people share their product & service experiences directly on these platforms. However, the problem is: these feedbacks are unstructured. These customer reviews on Dominos are picked from twitter..

As you may see here, one has to manually go through each review to figure out customer sentiment. And definitely there is no aggregated sentiment that we may conclude from here, as well.

So, the problem is, how do businesses analyse these unstructured customer feedbacks at scale?

Well, this is where machine learning comes in.

Natural Language Processing, or NLP based Sentiment Analysis models can predict sentiments for such unstructured reviews at scale.

Business Case — Introduction

Alright, now when we understand the business pain points behind picking a Sentiment Analysis Project, now, it’s time for us to understand our dummy client ABC Restaurant’s business case.

Our client ABC Restaurant intends to build a binary review classification model for reviews received on their facebook page. Binary here means: positive & negative. So essentially, positive reviews are appreciation and negative reviews are criticism for our client’s restaurant. Business intends to build an in-house customer support team to callback all customers who give negative feedback, and try to resolve their issues or give them discounts, so as to ensure they revisit.

Client has shared historic data on customer reviews for their restaurant along with positive/negative labels. Additionally, they have shared a fresh customer review dump with us to generate labels. So, our deliverables to the client would be:

a sentiment-based binary review classification model, and
labelling on fresh reviews dump as positives/negatives

Let’s understand our datasets further. Here’s a sneak peek:

Historical review dataset contains 900 reviews as text, along with labels: 1 & 0 (“1” meaning positive & “0” meaning negative review), which our client got prepared through manual audit of these reviews. Also, we have a second dataset of 100 reviews from current week, for which we have to populate this Liked column through our sentiment analysis model. Project files are here.

Now, let’s understand how we would go about solving our client’s business problem with a machine learning approach.

Plan of action

As we are working with text data, there are couple of tasks we need to complete before we get into the model training part of our sentiment analysis model. These are:

Data Preprocessing: where we filter out the unwanted / non value adding parts of our textual data, so as to be computationally efficient
Text-to-Numeric Representation: As computers only work with numbers, we need to figure out a way to represent our data in numeric form

Let’s discuss these guys one-by-one now..

Data Preprocessing — Intuition

Let’s take 6 sample reviews for me to explain the intuition behind preprocessing. Our objective with this data cleaning exercise is to drop unwanted non-value-adding characters & words that would otherwise unnecessarily consume computational resources.

Let’s go step by step now to understand how we would achieve it.

Firstly, as you may see, we have these dots and commas as special characters, which strictly speaking, do not tell us whether a review is good or bad. Even the numbers, as is the case in this 4th review here, is not telling us anything on the sentiment behind the statement. So, let’s drop these special character values.

Next up, as you may see, the two “Not” here are the same english dictionary word, depicting the same negative sentiment. But, computer would understand them differently, as a ‘not’ with small n and another ‘Not’ with capital N. To tackle this, we may simply convert all sentences to small case. Like this..

Moving on, as third step, we would do a couple of things..

All words highlighted in red here, are called stopwords, which generally are non value adding. These words mostly do not help us in understanding sentiment behind a review. So, we would drop these.

Also, the highlighted green depicts how the same dictionary word ‘recommend’ is used in different forms, but has the same sentiment. To ensure computer understands these two as the same word, we would convert all english words to their root. This process is called, stemming.

This is how our dataset would look post we drop stopwords and perform stemming on the remaining. As we may see yourself, we definitely have a reduced number of words in these sample reviews now. That’s data preprocessing.

Bag-of-Words Representation — Intuition

To make a computer understand our textual data, we need to somehow convert our reviews’ text into numbers. For this, we convert our cleaned reviews to a bag of words representation. From the 6 reviews we have discussed till now, let’s pick the first three for this discussion on bag-of-words intuition.

So this is our dataset post cleaning. For transformation to bag of words representation, system would simply identify all unique words (also called tokens) in the review column here, and form separate columns for each of these tokens. Like this.

As you may observe, for every review, system puts a 1, if that token is present in the review, or 0, if that token is not.

Bag of words representation is called so, as it discards information on order and sequencing of words.

Also, when we come to the model building part, we would drop some of these token that rarely appear in our reviews. That way, we are improving sparsity.

Cool, this completes data preprocessing & text-to-numeric representation intuitions for our sentiment analysis model. Next up, we will discuss intuition on Naive Bayes, which we are using as our model classifier.

Naive Bayes Classifier — Intuition

We are using Naive Bayes classifier for our sentiment analysis model. Let me give you a quick intuition on how Naive Bayes Classifier works. Here, I am using the same Bag of Words, we prepared in the previous section.

Assume, we get a new customer review saying “This place is wow”, and we have to predict sentiment for this review. As we know by now, post data cleaning, this review should look like this, having two tokens, “place” and “wow”. In bag of words representation, this new review would look like this, with 1’s against the tokens WOW and PLACE. And zeros against other model dictionary tokens. BTW, all these tokens collectively are called model dictionary.

For classification, Naive Bayes Classifier uses conditional probabilities. As we may observe in historic reviews, for token WOW, when it is present, review is always positive, given the limited training data we have here. So, it depicts a positive sentiment. Similarly, when token LOVE is absent, review is always negative. Depicting a negative sentiment. Likewise for PLACE.

For CRUST however, review is both positive and negative, when this token is absent. With 50:50 chances, we can’t take a call here. Thus, we would need more data to decide, whether CRUST as a token contributes to positive sentiment or negative. Likewise, we have done this exercise for all tokens, analysing respective conditional probabilities.

With the limited information we have, we can see three tokens are showing positive sentiment, against one that suggests negative. Thus, we are concluding this review is a positive review.

This is, loosely speaking, how Naive Bayes Classifier would classify reviews for us.

With this understanding, now we are all set for our sentiment analysis model training.

Model building

You may access the jupyter notebook for model training here with the complete Python code: link (named: b1_Sentiment_Analysis_Model.ipynb). Within this code file, sequentially we are doing the following:

Import required dependencies & our reviews dataset from Google Drive
Perform data preprocessing, following the same set of steps as discussed in the intuition part, earlier. Here, we do the same set of operations, using python code
Next up, we transform our dataset into bag-of-words representation, using CountVectorizer from sklearn. The number ‘1420’ here means that we are picking the top 1420 most frequent tokens and dropping the leftovers. This helps us to reduce sparsity, and is a standard data science practice. We reached to this 1420 by doing hit and trial. We are saving this bag-of-words dictionary back to google drive
At this point, we split up our data into 80:20::training:test and train our Naive Bayes Sentiment Analysis Model on the training set. We also check for model performance on the test set, which is a decent ~73%. Again, saving model file back to drive for later use

Well, this completes model building part of our sentiment-based review classifier for ABC Restaurant Ltd. Congratulations to you for making it to this point!!

Predictions on unseen reviews

Next up, we are using our sentiment predictor (b2_Sentiment_Predictor.ipynb) from our project folder to predict sentiments on fresh/ unseen reviews dataset.

Prediction output file (c3_Predicted_Sentiments_Fresh_Dump.tsv) is stored back on the drive folder for our further analysis.

With this our job is more or less done, except for the last part, which is delivering this solution to our client ABC Restaurant.

Conclusion

We had couple of deliverables, one, to train a sentiment-based review classification model and two, to generate labels for fresh customer reviews. We have generated labels for fresh customer reviews that were shared with us. Which looks something like this.

From sentiment predictions obtained from our model on fresh customer reviews, it is evident that majority of ABC Restaurant’s customers are not happy with the services on offer. This should definitely be a cause of concern for business. With further deep dive, we were able to gather this list of issues that ABC Restaurant’s customer support team need to pick up as soon as possible, to ensure customers revisit:

Restaurant staff being rude
Bad food (too much garlic/not fresh)
Disliked concept/theme
Slow service
Overpriced drinks
Cleanliness issues
Live green caterpillar found

Alright, with this, we have completed our course on Sentiment Analysis.

Brief about Skillcate

At Skillcate, we are on a mission to bring you application based machine learning education. We launch new machine learning projects every week. So, make sure to subscribe to our youtube channel and also hit that bell icon, so you get notified when our new ML Projects go live.

Shall be back soon with a new ML project. Until then, happy learning 🤗!!