Code Diaries: Sentiment Analysis (naive bayes)
When searching for my next favorite vegan restaurant, Google Reviews is the first place I check before I hop in the car. And the same can be said of my k-drama selections. Out of five stars, how many do you think this user rated the k-drama Crash Landing on You?
“Hands down one of the best kdrama I’ve ever seen. It’s special in [its] own kind and this drama is really close to my heart.”
-Aurora Nebula, Google Reviews
Shame on AnnaSophia Robb...it was five stars, which I wholeheartedly agree with ;)
Okay enough about CLOY, let’s get on with another interesting acronym: NLP!!!
As part of my quest to learn more about NLP, I coded a project that uses a Naive Bayes classifier to perform sentiment analysis. Similar to the CLOY review above, I trained my classifier on movie reviews from the imdb dataset. This is one of the most common examples of sentiment analysis, but other use-cases include predicting election outcomes and market trends. In preparation for my project, I learned that sentiment is categorized under Scherer’s typology of affective states:
Emotion (e.g. angry, sad, joyful)
Mood (e.g. gloomy, cheerful, depressed)
Interpersonal stance (e.g. friendly, flirtatious, distant)
Attitude (e.g. liking, loving, hating)
Personality Trait (e.g. arrogant, reckless, confident)
For the k-drama review, we analyzed the user’s attitude because it reflects the disposition towards an object and/or person. Similarly, my program analyzes the attitudes of movie reviews to predict a positive or negative polarity.
The human language is full of nuances that make sentiment analysis- and NLP in general- challenging. I touched on this in my previous article regarding crash blossoms. These nuances include- but are not limited to- subtlety and thwarted expectations. For example, some of the movie reviews in the imdb dataset are:
"The 33 percent of the nations nitwits that still support W. Bush would do well to see this movie, which shows the aftermath of the French Revolution and the terror of 1794 as strikingly similar to the post 9/11 socio-political landscape."
This review excerpt is subtle because there aren’t many keywords that indicate a strong polarization towards positive or negative. Perceiving reviews like this one require background political and historical knowledge about the French Revolution and 9/11. You would not only train your classifier on sentiment polarity, but public sentiment as well. Ultimately, the imdb dataset listed this review as negative.
"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore and it [has] continued its decline further to the complete waste of time it is today."
At a glance, this review seems positive, since it includes words with positive associations: ‘amazing’, ‘fresh’, etc. However, as you read further the true sentiment is revealed. This review draws upon the context of when the film was made; 1990s.
Luckily, the imdb dataset pre-determined the polarity of each review. I chose to base my program off of Pang and Lee’s baseline method. I split my code into 4 steps that include tokenization, feature extraction, and classification.
However, the baseline algorithm assumes equal class frequencies within the dataset. Most real-world scenarios don’t fit that scenario, but I still decided to follow the baseline method because the purpose of my project was to get an introductory practice to sentiment analysis. Many additional calculations arise when working with unbalanced class frequencies, such as re-sampling the training data, cost-sensitive learning, and getting an F-score to accurately train the classifier.
So in order to check whether my selected dataset were balanced, I performed Exploratory Data Analysis to get a summary of each data set.
Here we can see that the imdb dataset has equal class frequencies! Yay! I was inclined to analyze the financial news dataset, but I’ll save that for another day ;) I’ll continue doing EDA before working with datasets to ensure I know what I’m working with!
Out of the Naive Bayes, MaxEnt, and SVM classifiers, I chose Naive Bayes because it’s good for large datasets (imdb has 50,000 reviews) and it’s simple for an NLP novice like myself to understand.
Naive Bayes is a linear classifier that extends Bayes’ theorem to assume that all data points are independent.
A linear classifier makes a linear decision boundary like such:
A limit to this model is that it may mis-classify some data points (note the green dots in the red space), unlike a nonlinear decision boundary:
Below is Bayes’ Theorem:
And finally, here is the Naive Bayes algorithm:
Now onto the technicality:
The imdb dataset was the first time I worked with a csv file. I used the pandas library to read the contents in the csv file to a DataFrame. Thus, I also learned how to access element within a DataFrame as well. My previous text prediction project worked with a txt file, so I accessed words through iteration. This time I used the pandas documentation to learn about the differences between
.loc(), .at(), .iloc(), .iat()
to access specific elements. Ultimately, I found .at() most useful because I wanted to access a single element, as opposed to a Series in a row/col with .loc()
Last but not least, I implemented the Naive Bayes algorithm with add-1 smoothing to account for words not seen in the training set. First I calculated the probability of a ‘token’ appearing in each class (positive or negative) in the test data. Then I divided that probability with the total words in the class and smoothed the denominator with the class count. Add-1 smoothing ensured that I wouldn’t get a ‘zero’ value for a word that did not appear in my training set.
Ultimately, the program predicts a sentiment based on which class probability is greater. And here is the final result and percent error of my Naive Bayes algorithm:
And that’s it! A future improvement would be to use cross-validation for training. However, using pandas’ df.sample() was handy because it selects a random chunk of data with each run. For now, I’m going to take a break and watch my next favorite k-drama, Vincenzo😍
Thanks so much for reading until the end! Happy coding!