Sentiment Analysis of Amazon Reviews Using Natural Language Processing
Reviews serve as the lifeblood of every business, offering invaluable insights into customer satisfaction, preferences, and areas for improvement.
In today’s digital age, where consumers wield unprecedented influence through online platforms, understanding and interpreting these reviews is paramount for businesses seeking to thrive in competitive markets.
However, with the large volume of reviews generated daily, manually analyzing them becomes impractical. This is where sentiment analysis emerges as a crucial tool, enabling businesses to extract actionable intelligence from the vast sea of customer feedback.
By deciphering the underlying sentiments expressed within reviews, sentiment analysis empowers businesses to make informed decisions, enhance customer experiences, and ultimately, drive success.
Project Overview
This project aims to perform sentiment analysis on Amazon reviews using two different approaches: VADER (valence-aware dictionary and sentiment Reasoner) and a pre-trained RoBERTa (Robustly optimized BERT approach) model..
Overview of Dataset
I downloaded the dataset from Kaggle: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
The dataset consists of Amazon product reviews, containing the following fields:
- Review ID: A unique identifier for each review.
- Product ID: A unique identifier for the product being reviewed.
- User ID: A unique identifier for the user who wrote the review.
- Profile Name: The name of the user who wrote the review.
- Helpfulness Numerator: The number of users who found the review helpful.
- Helpfulness Denominator: The total number of users who indicated whether the review was helpful or not.
- Review Score: The star rating given by the user, ranging from 1 to 5.
- Timestamp: The timestamp of when the review was posted.
- Review Summary: A summary of the review content.
- Review Text: The main body of the review, contains detailed feedback and opinions.
The dataset contains a total of 568,454 reviews but only 500 were selected for analysis in this project.
Technologies Used
Python, libraries
- Pandas
- Matplotlib
- Seaborn
- NLTK (Natural Language Toolkit)
- Transformers
- Torch
- TensorFlow
- Flax
Data Exploration and Processing
- Import relevant libraries
pandas
, numpy
, matplotlib
, and seaborn
are commonly used libraries for data manipulation, numerical computation, and visualization.
The line plt.style.use('ggplot')
sets the plotting style to emulate the visual aesthetic of the popular R package, ggplot2. This style choice results in plots with a distinctive appearance characterized by bold lines, a gray background, and a combination of colorful elements.
nltk
is the Natural Language Toolkit, which provides tools for natural language processing tasks.
2. Read the data
Here is an overview of the data frame ;
3. Data Pre-processing
Dataset size
The dataset has 568454 rows and 10 columns
Limit the size of the dataset for faster processing to a number of your choice, I chose 500. If you have the time and resources of course you can continue with the whole data set.
4. Exploratory Data Analysis
Generate a bar plot showing the distribution of review scores, this helps to visualize the distribution of reviews based on star ratings.
Result
This shows that most of the reviews are positive, and the negative reviews are very few.
Sentiment Analysis
Basic NLTK Sentiment Analysis
Download necessary NLTK resources such as lexicons, tokenizers, and part-of-speech taggers.
How it works
- First extract one review from the 500 review that we will analyze as an example
This code prints out the text content of the review located at index 49 in the ‘Text’ column (which contains the review text) of the DataFrame df
.
- Tokenize the content and slice it for faster processing
tokens = nltk.word_tokenize(example)
: This line tokenizes the text content of the review stored in the variable example
.
Tokenization is the process of breaking down a text into individual words or tokens.
The word_tokenize
function from NLTK is specifically used to tokenize text into words.
tokens[:10]
: This line slices the list of tokens obtained from tokenization to display only the first 10 tokens.
The [:10]
syntax is used to specify that we want to display elements from index 0 to index 9 (the first 10 elements) of the tokens
list.
This code therefore essentially tokenizes the text content of a review and then displays the first 10 tokens obtained from the tokenization process.
- The next step is Part-of-speech (POS) tagging, which is used to assign a part-of-speech tag to each tokenized word in the review.
POS tagging plays a crucial role in various NLP tasks by providing linguistic insights and facilitating the analysis and interpretation of textual data.
This code performs part-of-speech tagging on the tokenized words of the review and then displays the part-of-speech tags assigned to the first 10 words.
- Lastly is Name Entity Recognition, the process where named entities in the text are identified and classified.
This is important because it helps in extracting specific entities such as persons, organizations, locations, dates, and more from text data.
These entities often carry significant meaning and context within the text, and extracting them can provide valuable insights for various natural language processing tasks, such as information extraction, question answering, and knowledge graph construction.
First Model: VADER
1. Initialize Vader Sentiment Analyzer
Breakdown
from nltk.sentiment import SentimentIntensityAnalyzer
: This line imports the SentimentIntensityAnalyzer
class from the nltk.sentiment
module. The SentimentIntensityAnalyzer
is a pre-trained model included in NLTK for performing sentiment analysis on text data.
from tqdm.notebook import tqdm
: This line imports the tqdm
library, which provides a fast, extensible progress bar for Python and wraps around the iterable object to provide a progress indicator during iterations.
The .notebook
the module is specific to Jupyter Notebook environments.
sia = SentimentIntensityAnalyzer()
: This line creates an instance of the SentimentIntensityAnalyzer
class and assigns it to the variable sia
. This analyzer is capable of analyzing the sentiment of text data by assigning polarity scores, such as positive, negative, neutral, and compound scores, to each piece of text.
2. Apply VADER Sentiment Analysis on the dataset
This code snippet iterates over each row in the DataFrame df
, retrieves the text content of each review, and performs sentiment analysis using the SentimentIntensityAnalyzer
(sia
). It then stores the sentiment scores for each review in a dictionary called res
, with the review ID ('Id'
column) as the key.
Remember : The purpose of the
tqdm()
function is to wraps around the iterator and display a progress bar during the iteration, making it easier to track the progress.
3. Plot VADER results
Plot a data frame for the results and merge it with the original data frame.
Dataframe
In VADER (Valence Aware Dictionary and sEntiment Reasoner), the terms neg
, pos
, and neu
represent different aspects of sentiment expressed in a piece of text. These aspects are determined based on the intensity of positive, negative, and neutral sentiments present in the text, respectively.
neg
(Negative Score): Theneg
score indicates the proportion of negative sentiment present in the text. It represents the extent to which the text expresses negative emotions, such as anger, sadness, or frustration. The valueneg
ranges from 0 to 1, with higher values indicating a greater degree of negativity.pos
(Positive Score): Thepos
score indicates the proportion of positive sentiment present in the text. It represents the extent to which the text expresses positive emotions, such as happiness, joy, or satisfaction. The valuepos
also ranges from 0 to 1, with higher values indicating a stronger positivity.neu
(Neutral Score): Theneu
score indicates the proportion of neutral sentiment present in the text. It represents the extent to which the text is neutral or lacks a strong emotional polarity. A higherneu
score suggests that the text contains more neutral language and less emotional content. Similar toneg
andpos
, the valueneu
ranges from 0 to 1.- the
compound
score is a single value that represents the overall sentiment polarity of a piece of text. It takes into account both the positive and negative sentiment scores, along with their intensities, to provide a comprehensive assessment of the text's sentiment. It ranges from -1 to 1.
4. Visualize VADER results
Plot a bar graph to visualize the results
Bar Graph
This indicates that most of the reviews are positive and with a high sentiment polarity in positivity while the negative reviews are few and have low sentiment polarity .
Sub-plot for each category
Create a subplot with three bar plots showing the distribution of positive, neutral, and negative sentiment scores across different review scores.
Sub-plots
Second Model: Roberta Pre-trained Model
- Install necessary packages and dependencies required for using the Roberta pre-trained model.
Breakdown
The code begins by using the !pip install
command to install several Python packages. These packages include:
torch
: PyTorch, a popular open-source machine learning library.tensorflow
: TensorFlow, another widely-used machine learning library developed by Google.flax
: Flax, a neural network library that is tightly integrated with JAX, a high-performance numerical computing library.tensorflow-intel
: An optimized version of TensorFlow for Intel architectures.ml-dtypes==0.2.0
: A specific version of the ml-types library.
After installing the required packages, the code imports specific modules from these packages. These modules are necessary for working with transformer-based models and performing sequence classification tasks.
AutoTokenizer
andAutoModelForSequenceClassification
are classes from thetransformers
library. These classes allow for easy loading of pre-trained transformer models and their corresponding tokenizers for various natural language processing (NLP) tasks.softmax
is a function from thescipy.special
module. It calculates the softmax function, which is commonly used to convert raw scores into probabilities.
2. Initialize the Roberta Model
- Define the Model: The variable
MODEL
specifies the name of the pre-trained model to be loaded. In this case, it's"cardiffnlp/twitter-roberta-base-sentiment"
, which refers to a RoBERTa model fine-tuned on Twitter data for sentiment analysis. - Load the Tokenizer: The
AutoTokenizer.from_pretrained()
function is used to load the tokenizer associated with the specified pre-trained model. The tokenizer is responsible for converting raw text input into a format suitable for the model to process. By usingAutoTokenizer
, the appropriate tokenizer for the specified model is automatically selected based on the model's name. - Load the Model: Similarly, the
AutoModelForSequenceClassification.from_pretrained()
function loads the pre-trained model for sequence classification. This model has been fine-tuned on sentiment analysis tasks and is capable of classifying the sentiment of a given text sequence into categories like positive, negative, or neutral.
By executing these lines of code, we have a pre-trained RoBERTa model and its tokenizer, ready for use in sentiment analysis. We will then input our data into the tokenizer, pass the tokenized input to the model for inference, and obtain predictions about the sentiment of the input text.
3. Run Roberta Model on Data
This code iterates over each row of the DataFrame df
, where each row represents a review. For each review, it performs sentiment analysis using both VADER (Valence Aware Dictionary and sEntiment Reasoner) and the pre-trained RoBERTa model.
Here's a breakdown of what it does:
Initialization
res = {}
: Initializes an empty dictionary res
to store the results of sentiment analysis for each review.
Iterating Over DataFrame
for i, row in tqdm(df.iterrows(), total=len(df)):
: Iterates over each row in the DataFrame df
, where tqdm
is used to create a progress bar to track the iteration progress.
Sentiment Analysis:
text = row['Text']
: Retrieves the review text from the current row.
myid = row['Id']
: Retrieves the unique identifier (ID) associated with the review.
vader_result = sia.polarity_scores(text)
: Performs sentiment analysis using VADER on the review text, which returns a dictionary containing sentiment scores.
vader_result_rename = {}
: Initializes an empty dictionary to store the VADER scores with modified keys.
for key, value in vader_result.items():
: Iterates over the items (key-value pairs) in the VADER result dictionary.
vader_result_rename[f"vader_{key}"] = value
: Modifies the keys of the VADER result dictionary by prefixing them with "vader_" and stores them in vader_result_rename
.
roberta_result = polarity_scores_roberta(text)
: Calls the polarity_scores_roberta
function to perform sentiment analysis using the RoBERTa model on the review text, which returns a dictionary containing sentiment probabilities.
both = {**vader_result_rename, **roberta_result}
: Combines the VADER and RoBERTa sentiment analysis results into a single dictionary named both
.
res[myid] = both
: Adds the combined sentiment analysis results to the res
dictionary, using the review ID as the key.
Exception Handling:
except RuntimeError:
: Catches any runtime errors that occur during sentiment analysis.
print(f'Broke for id {myid}')
: Prints a message indicating which review ID caused the runtime error.
Overall, this code performs sentiment analysis on each review text in the DataFrame using both VADER and the RoBERTa model, and stores the results in a dictionary for further analysis.
It handles exceptions gracefully by printing an error message if any issues occur during sentiment analysis.
3. Analysis Results
This shows that both models have close to similar results, with a high number of positive reviews and very few negative reviews
Create a data frame to plot the results
Data frame
Comparison and Evaluation
Plot a pair plot to compare results
This code generates a pair plot using Seaborn’s pairplot
function to visualize the relationship between sentiment scores predicted by VADER and the RoBERTa model, with respect to the review scores.
Pair Plot
Conclusion
Summary of Findings
- Both VADER and RoBERTa models showed similar results, with a high number of positive reviews and very few negative reviews.
- The sentiment analysis revealed that most of the reviews in the dataset were positive, indicating overall satisfaction with the products.
Limitations and future improvements
- Limited Dataset Size: The analysis was performed on a subset of the dataset (500 reviews). Utilizing the entire dataset could provide more comprehensive insights.
- Model Performance: While both VADER and RoBERTa models performed well, there is always room for improvement in model accuracy and generalization.
- Additional Features: Incorporating additional features such as reviewer demographics or product categories could enhance the analysis and provide deeper insights into customer sentiment.
View my code on Github, here.