NLP: Sentiment Analysis or Emotion Mining on Amazon Product Reviews - Part-1

Published in

CodeX

7 min readSep 23, 2021

Kudos! Now that you are successful in mining the text corpus from Amazon website, let’s learn the NLP techniques to perform Sentiment Analysis or Emotion Mining on extracted Product Reviews from Amazon.

In case, you are still struggling on how to do Text Mining, do refer to my previous article Text Mining: How to extract Amazon Reviews using Scrapy explained in a very simpler manner.

So, let’s get started. Here’s the systematic approach to Natural Language Processing to perform Sentiment Analysis or Emotion Mining. This article is divided into two parts. Part-1 covers Text preprocessing and Feature extraction, the next part covers Sentiment Analysis or Emotion Mining on text corpus.

Import Libraries and Dataset

First, we start by importing all the required libraries with regards to NLP using python.

The very important libraries with regards to NLP are String and Spacy. Both play a huge role in text processing.

String in python is a sequence or ordered set of characters. It is a derived datatype that once defined cannot be changed. However, various string modifications are possible with replace(), join(), strip() or split() commands, but they do not change the original string, instead they modify a copy of that string and return a value.

Spacy is used for advanced NLP in python. It is designed precisely to process and understand large volumes of text including information extraction, natural language understanding (NLU) or preprocess text for deep learning. It has many linguistic machine learning functionalities like Tokenization, Part-of-speech (POS) Tagging, Lemmatization, Named Entity Recognition (NER), Text Classification etc.

Now, that you have some background about these libraries, let’s move on further and import the extracted Amazon reviews dataset that we have already extracted for Bosch washing machine front load from Amazon website as below:

The output looks like:

Text Preprocessing

The starting point is to clean and prepare the text data before it is fed into model or for analysis purpose. Text preprocessing helps to get rid of noises present in the data in the form of comments, reviews or tweets. It helps perform text analytics by converting all characters to lowercase, removing punctuation marks, removing stop words, typos etc.

We start by removing both leading and trailing characters, and removing empty strings

2. Next, join the previous output list into one string/text

3. Remove punctuation from the string or text

4. Perform Text Tokenization using NLTK library

5. Remove Stop-words (frequently appearing generic words having less significance) from the text tokens

Text Normalization

After preprocessing the text, next step is Text Normalization i.e. converting the text to all lower case letters. It is done to reduce randomness and remove bias, by converting it into standard format, to improve computer’s efficiency to deal with different types of information.

The goal of text normalization is also to derive the word’s root or basic form. This can be achieved using either of the two ways: Stemming or Lemmatization.

Stemming

In here, last few characters are removed from each word and often has incorrect spellings and meanings. For ex: wasting -> wast

Lemmatization

It’s better than stemming, as it considers the context of the words and converts it into correct spellings and meanings. For ex: wasting -> waste

Feature Extraction

After text normalization and deriving its base form, next step is Feature Extraction. It focuses on analyzing the similarities between different pieces of information or text. NLP algorithm or model cannot process raw text and hence needs feature extraction to extract features and convert text into matrix or vector format. The two most popular methods are BOW and TFIDF.

1. BOW: Bag Of Words Count Vectorizer

Bag of words creates a vocabulary of unique words present in a text corpus and performs text vectorization. BOW count vectorizer creates a matrix of features by assigning separate column for each word and each row corresponds to a review text. The values inside matrix signify presence (1) or absence (0) of the unique words in that review.

BOW CV using N-Grams

The major disadvantage of BOW count vectorizer is that the order of words occurrence is lost as the vector of tokens is created in random order. Thus, the text looses its contextuality. This problem can be overcome by using N-grams that preserve local ordering of words. N-grams can be Uni-grams (single word like ‘happy’), Bi-Grams (two words together like ‘totally disappointed’) and Tri-grams (three words together like ‘incomplete without service’).

However, using N-grams with BOW creates huge sparse matrix (lots of 0’s) when size of vocabulary is large. Thus, we need to remove both high frequency n-grams aka stopwords and low frequency n-grams aka typos. Ideally medium frequency n-grams work best. We can also limit the number of features to be used in matrix.

2. TFIDF Count Vectorizer using N-Grams

One more disadvantage of BOW count vectorizer using n-grams is that it ignores the n-grams which appear rarely in the corpus but might have significant value like ‘No demo call’. Though, this n-gram appears rarely in the corpus but highlights major problem to be looked upon. To its rescue, comes TFIDF vectorizer.

TFIDF is acronym for Term Frequency Inverse Document Frequency is a product of Term Frequency and Inverse Document Frequency. The TFIDF value increases proportionally to the no. of times a word appears in a document or review and decreases inversely to the no. of documents in the corpus that contain that word. TFIDF =TF x IDF

TFIDF highlights specific n-grams that rarely occur but hold great importance. The TFIDF score is high, when the n-gram has high frequency in a document and low document frequency in the corpus. For a n-gram, that has high document frequency in corpus, IDF value and ultimately TFIDF value approaches to 0. TFIDF value is high when both TF and IDF values are high i.e. the n-gram is rare but present in the document and has low document frequency in corpus.

Now, our text data which was difficult to understand by computer machine is transformed into numerical data which can be easily processed by it. Hereafter, you can go ahead and build suitable prediction model and find answer to business problems.

As the purpose of this article is not model building but analyzing the sentiment behind the text, let’s continue further.

Word Cloud

A very nice visualization of high frequency words used in the text corpus can be achieved by generating a word cloud. In a word cloud, size of each word indicates its frequency or importance. Just by looking at it, you can judge what your customers are saying about your product.

Here, we are done with how to perform text preprocessing and feature extraction. To read further on how to perform sentiment analysis and emotion mining refer to the upcoming article NLP: Sentiment Analysis or Emotion Mining on Amazon Product Reviews - Part-2.

P.S. I shall write more about how to perform NLP Text Mining including Text Preprocessing, Features Extraction, Named Entity Recognition and Emotion Mining or Sentiment Analysis of Product Reviews on Amazon in my coming article. So do follow my posts on Medium 😃 Happy Learning!

Please follow me on GitHub having 170+ and more such repositories.

Also, do let me know, what do you think about this article. Please do give claps if you really find this article useful.