Understanding The Emotions Behind The Human Language

Sentiment Analysis of Reviews Using NLTK in Python

Published in

Geek Culture

10 min readJun 17, 2022

Based on research done by researchers Roger Bohn and James Short at the University of California, San Diego, the average American consumes ~34 gigabytes of data every day (as of 2012).

That is… a lot of data. At the time of writing, the amount of data consumed by an individual would likely be significantly greater.

If an individual could consume such a large amount of data per day, how much data do you think organizations are collecting?

The One Datatype That Stood Out — Textual Data

Among the vast volume of data on the web, one type of data that stood out is text — with some sources estimating that information on the web is 80% textual.

Indeed, regardless of the percentage of textual data on the web, none of us could deny the significance these textual data hold.

Through a process known as text mining, i.e. extracting insights from textual information, companies are able to make data-driven business decisions to grow and scale their organization.

In this article, I will share a simple sentiment analysis algorithm to derive the sentiment of text using the NLTK library in Python.

What is sentiment analysis? What is NLTK? Read on to find out!

What is NLTK?

NLTK stands for Natural Language Toolkit, an NLP (Natural Language Processing) library in Python which allows us to analyze textual data and derive valuable insights.

Some features NLTK provide includes

Tokenization (i.e. splitting sentences into a list of words, or splitting paragraphs into a list of sentences)
Stemming (i.e. reducing a word to its root form. Caring, careful, and careless belongs to the root word of ‘care’)
Parts of Speech Tagging (i.e. tagging each word as a noun, pronoun, adjective, verb, etc.)
…and more!

This powerful library allows us to analyze our text data and derive the sentiment of texts.

What is Sentiment Analysis?

Sometimes referred to as opinion mining, sentiment analysis refers to identifying the emotions expressed in a text.

Example

“I am very happy” is a positive sentence, and thus it is identified to have a positive sentiment.
“This camera is so bad that it made me cry” is a negative sentence, and thus it is identified to have a negative sentiment.

Our Workspace — Jupyter Lab

To demonstrate examples of certain methods in the NLTK library, I will be using JupyterLab as my primary workspace.

JupyterLab is a web app that allows us to see interactive outputs based on our python code. It is also often used by data scientists and machine learning engineers.

Our Textual Data Source — Amazon Kindle Device Review

Our objective in this project is to identify the sentiment of various texts.

Here, I will be using the reviews of a popular reading device on the Amazon store, the Kindle device , as the data source to perform the sentiment analysis. To help us with this process, I have completed the data collection phase and extracted several reviews containing both positive and negative sentiments.

kindle_reviews.txt

As humans, we can quite clearly identify that the first review is negative and the second review is positive.

First review — The warranty and support is really bad : clearly negative
Second Review — I highly recommend this... : clearly positive.

Let’s see how we can create a sentiment analysis algorithm to do the same.

Our Sentiment Analysis Algorithm

For our sentiment analysis algorithm, we will create the following workflow:

Read the file source and extract individual reviews
For each review, we will tokenize the paragraph into word tokens using nltk.word_tokenize()
For each word in the review, we will identify the part of speech (i.e. noun, verb, adjective, etc.) associated with it using ntlk.pos_tag()
After obtaining the list of words along with their respective parts of speech, we will extract all adjectives into an adjective word list.
Next, we will iterate through each adjective in the adjective word list and identify if that adjective is positive or negative based on our positive.txt and negative.txt files which contain a list of positive and negative words.
If the adjective is positive, we will add 1 to the sentiment score of the review.
If the adjective is negative, we will subtract 1 from the sentiment score of the review.
If the final score is positive, we can safely assume that the review is positive. Also, the higher the score, the more positive it is. Otherwise, the review can be deemed as neutral (if the score is 0) or negative (if the score is negative).

Confused? Don’t worry, I’ll be explaining each block of code to create this algorithm so follow along!

Note* — The source code for this project is also available on GitHub: https://github.com/cyberjj999/kindle_reviews_sentiment_analysis

Without further ado, let us jump right into coding!

Step 1. Read the file source and extract individual reviews

The first step is to extract our individual reviews from kindle_reviews.txt, a file containing reviews that I’ve prepared beforehand.

To perform programming logic on each review, we need to first extract all the reviews into a list.

Here, we open the file containing the reviews and extracted each review as a list item by splitting it with the delimiter \n\n

Accessing the last element of the list using review_list[-1] shows us the last review.

Note* — As the data are prepared beforehand in this project, there are no data-wrangling procedures required. However, if you web scrape the reviews using python libraries like bs4, you will need to perform data cleaning so the source data can be accurately used for further analysis.

Step 2. For each review, we will tokenize the paragraph into word tokens using `nltk.word_tokenize()`

Here, we are tokenizing each review (i.e. paragraph of texts) into a list of words using the nltk.word_tokenize() method.

With the list of words, we can proceed to perform the part of speech identification.

Note* — The word_tokenize() method in nltk is not as simple as a string.split() method in Python. From analyzing the method’s source code, we can see that various regex algorithms are implemented in the tokenization process.

Step 3. For each word in the reviews, we will identify the part of speech associated with it using ntlk.pos_tag()

To identify parts of speech, we can pass our list of words as an argument into the nltk.pos_tag() method and specify the parameter of tagset='universal'

The purpose of tagset='universal' is to simplify the part of speech tagging into general categories such as verbs, nouns, adjectives, etc. Without using the universal tagset, there will be very specific tagging and multiple variations of verbs, nouns, adjectives, etc.

The output is a list of tuple with the first element storing the word, and the second element storing the part of speech value — i.e. ('reading', 'VERB')

Step 4. After obtaining the list of words along with their parts of speech, we will extract all adjectives into an adjective word list.

This step requires some basic python knowledge of working with listand tuple .

We can do a simple list comprehension to filter the pos_list and extract only adjectives — i.e. those tagged as "ADJ" .

adj_list = [word[0] for word in pos_list if word[1] == 'ADJ']

This code iterates through pos_list and extracts the first element (the original word) if the second element (the part of speech value) is equal to ADJ (adjective).

Great! We now have a list of adjectives. The next step is simple.

Step 5. We will iterate through each adjective in the adjective word list and identify if that adjective is positive or negative based on our positive.txt and negative.txt files which contain a list of positive and negative words.

We can first extract all positive and negative words from the text file and store them in lists.

Here, you may notice that some words are misspelled — such as ‘accessable’. This is not a mistake as internet reviews are often full of typos/mispellings. This list will take into consideration commonly misspelled words which allow our analysis to be more accurate.

Next, we will calculate our sentiment score by comparing our adjective list with the positive and negative word lists.

If the adjective is positive, we will add 1 to the sentiment score of the review.
If the adjective is negative, we will subtract 1 from the sentiment score of the review.

We first specify a sentiment_score variable to track our sentiment score. Then we check if each adjective is in the positive or negative word list and add/subtract the sentiment score accordingly.

It is important to note that we are also normalizing our review adjectives by converting them to lowercase using the lower() method. This allows us to more accurately match each adjective to the positive and negative word list which only contains lowercase words.

Here, you can see that the final sentiment score calculated is 4.

Step 6. If the final score is positive, we can safely assume that the review is positive. Also, the higher the score, the more positive it is.

Using this logic, we can see that the sentiment of our review is positive — with a desirable score of 4!

Let us take a look at the review once again.

"I've been reading with an e-reader for years. Mostly a Kobo. Also have the Kobo Forma, great e-reader. But I was always curious about the Kindle. Now ordered the Kindle with Amazon.nl and I'm very happy with it! Nice small size. Fits almost in my pocket. Fantastic dictionaries and great ease of reading. I can see that there are years of experience with e-readers behind this. Super purchase."

great , happy , Fantastic , super— seems like we have lots of positive words!

Accurate? It definitely seems like it!

The Output After Tidying Up The Code

As we can see, the algorithm works decently well! We were able to identify different positive and negative adjectives and make a decision on whether they contribute to the sentiment score positively or negatively.

Of course, there are some clearly positive/negative words that were missed because it was not identified as an adjective by nltk. The reverse would happen if we simply count all positive/negative words regardless of their part of speech. We would end up capturing all positive/negative words regardless of their context.

This goes to show that good sentiment analysis isn’t as simple as counting the number of positive and negative adjectives. and how there will be numerous cases where this algorithm won’t perform desirably.

Improvements to the Algorithm

There are many algorithms we can explore to refine our current sentiment analysis algorithm further.

One improvement is making use of bigrams or trigrams (i.e. consecutive adjacent words) to calculate the sentiment score more accurately.

Consider the following

“This device is not good at all. I heard from my friend that it was great, but it is really not very impressive.”

We can probably identify 3 positive words here — “good”, “great”, and “impressive”. However, it is clear that the review is negative.

To improve our algorithm, we can take into account the word(s) before a positive/negative word. "NOT good" = <negative> * <positive> = <negative> and thus it should be identified as negative. "NOT bad" = <negative> * <negative> = <positive> and thus this should be identified as positive.

Using bigrams and trigrams , we could likely make the algorithm much more robust.

Closing Thoughts

And… that’s a wrap! Thank you for reading my article on sentiment analysis with NLTK in Python! I hope you had some valuable takeaways and learned something new.

The human language is deeply complicated, but it is also extremely important. Some shared that language is one of the primary reasons we were able to truly thrive and flourish as a species and I definitely agree!

Understanding language is as much of an art as it is a science and I am hopeful for the advancement of NLP (Natural Language Processing) in the coming years!

Just a gentle reminder — you can access the full source code of this project here: https://github.com/cyberjj999/kindle_reviews_sentiment_analysis

Keep rocking and enjoy learning! Until next time!