Code Diaries: Sentiment Analysis (naive bayes)

Jadesse Chan
Apr 17 · 6 min read

When searching for my next favorite vegan restaurant, Google Reviews is the first place I check before I hop in the car. And the same can be said of my k-drama selections. Out of five stars, how many do you think this user rated the k-drama Crash Landing on You?

“Hands down one of the best kdrama I’ve ever seen. It’s special in [its] own kind and this drama is really close to my heart.”

-Aurora Nebula, Google Reviews

CLOY = Crash Landing On You

Shame on AnnaSophia Robb...it was five stars, which I wholeheartedly agree with ;)

Okay enough about CLOY, let’s get on with another interesting acronym: NLP!!!

As part of my quest to learn more about NLP, I coded a project that uses a Naive Bayes classifier to perform sentiment analysis. Similar to the CLOY review above, I trained my classifier on movie reviews from the imdb dataset. This is one of the most common examples of sentiment analysis, but other use-cases include predicting election outcomes and market trends. In preparation for my project, I learned that sentiment is categorized under Scherer’s typology of affective states:

Emotion (e.g. angry, sad, joyful)

Mood (e.g. gloomy, cheerful, depressed)

Interpersonal stance (e.g. friendly, flirtatious, distant)

Attitude (e.g. liking, loving, hating)

Personality Trait (e.g. arrogant, reckless, confident)

For the k-drama review, we analyzed the user’s attitude because it reflects the disposition towards an object and/or person. Similarly, my program analyzes the attitudes of movie reviews to predict a positive or negative polarity.

The human language is full of nuances that make sentiment analysis- and NLP in general- challenging. I touched on this in my previous article regarding crash blossoms. These nuances include- but are not limited to- subtlety and thwarted expectations. For example, some of the movie reviews in the imdb dataset are:

This review excerpt is subtle because there aren’t many keywords that indicate a strong polarization towards positive or negative. Perceiving reviews like this one require background political and historical knowledge about the French Revolution and 9/11. You would not only train your classifier on sentiment polarity, but public sentiment as well. Ultimately, the imdb dataset listed this review as negative.

At a glance, this review seems positive, since it includes words with positive associations: ‘amazing’, ‘fresh’, etc. However, as you read further the true sentiment is revealed. This review draws upon the context of when the film was made; 1990s.

Luckily, the imdb dataset pre-determined the polarity of each review. I chose to base my program off of Pang and Lee’s baseline method. I split my code into 4 steps that include tokenization, feature extraction, and classification.

However, the baseline algorithm assumes equal class frequencies within the dataset. Most real-world scenarios don’t fit that scenario, but I still decided to follow the baseline method because the purpose of my project was to get an introductory practice to sentiment analysis. Many additional calculations arise when working with unbalanced class frequencies, such as re-sampling the training data, cost-sensitive learning, and getting an F-score to accurately train the classifier.

So in order to check whether my selected dataset were balanced, I performed Exploratory Data Analysis to get a summary of each data set.

Here we can see that the imdb dataset has equal class frequencies! Yay! I was inclined to analyze the financial news dataset, but I’ll save that for another day ;) I’ll continue doing EDA before working with datasets to ensure I know what I’m working with!

Out of the Naive Bayes, MaxEnt, and SVM classifiers, I chose Naive Bayes because it’s good for large datasets (imdb has 50,000 reviews) and it’s simple for an NLP novice like myself to understand.

Naive Bayes is a linear classifier that extends Bayes’ theorem to assume that all data points are independent.

A linear classifier makes a linear decision boundary like such:

credit: http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes02-wordvecs2.pdf

A limit to this model is that it may mis-classify some data points (note the green dots in the red space), unlike a nonlinear decision boundary:

credit: http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes02-wordvecs2.pdf

Below is Bayes’ Theorem:

credit: https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html

And finally, here is the Naive Bayes algorithm:

credit: http://spark-public.s3.amazonaws.com/nlp/slides/sentiment.pdf

Now onto the technicality:

The imdb dataset was the first time I worked with a csv file. I used the pandas library to read the contents in the csv file to a DataFrame. Thus, I also learned how to access element within a DataFrame as well. My previous text prediction project worked with a txt file, so I accessed words through iteration. This time I used the pandas documentation to learn about the differences between

to access specific elements. Ultimately, I found .at() most useful because I wanted to access a single element, as opposed to a Series in a row/col with .loc()

the if-statement checks whether the element at each row (index) in the ‘sentiment’ column is positive or negative

Last but not least, I implemented the Naive Bayes algorithm with add-1 smoothing to account for words not seen in the training set. First I calculated the probability of a ‘token’ appearing in each class (positive or negative) in the test data. Then I divided that probability with the total words in the class and smoothed the denominator with the class count. Add-1 smoothing ensured that I wouldn’t get a ‘zero’ value for a word that did not appear in my training set.

Ultimately, the program predicts a sentiment based on which class probability is greater. And here is the final result and percent error of my Naive Bayes algorithm:

predictions and actual are to lists that hold the classification decision (former) and the actual classification (latter) of each test review

And that’s it! A future improvement would be to use cross-validation for training. However, using pandas’ df.sample() was handy because it selects a random chunk of data with each run. For now, I’m going to take a break and watch my next favorite k-drama, Vincenzo😍

Thanks so much for reading until the end! Happy coding!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Jadesse Chan

Written by

I’m a Computer Science major and Creative Writing minor at Rhodes College! To learn more, visit me at jadessechan.com

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com