Analyzing E-Commerce Customer Reviews with NLP

Published in

Institute for Applied Computational Science

13 min readDec 29, 2020

By Junyi Guo, Yinyu Ji, Sam Woo, Cheng Zhan

This article was produced as part of the final project for Harvard’s AC295 Fall 2020 course. Thank you to Pavlos, Rashmi, and the teaching fellows for a great semester! A special thanks to our teaching fellow Shivas Jayaram for all of the support we received.

Image source: https://www.zdnet.com/article/amazon-dominates-holiday-shopping-plans-again/

Introduction

With the vast numbers of customer reviews generated on e-commerce marketplaces like Amazon, we are interested in generating insights that are useful to potential customers, sellers on the platform, and the e-commerce platform itself. Through such insights, customers can more easily identify useful reviews and products with popular attributes, while retailers can use the insights to improve their products and marketing strategies (Bing et al. 2016). Some of the problems which we tackled using state-of-the-art natural language processing models and live data were: 1.) Can we predict whether reviews are positive or negative, helpful or unhelpful, and trustworthy or fake? 2.) What summaries, features, or topics can be extracted from customer reviews?

In this article, we will take you through how we answered these questions by fine-tuning BERT models using the Amazon reviews dataset to classify reviews by certain features. We also leveraged BART to summarize reviews and LDA to generate key words. Finally, we developed and locally deployed a web application that can help a user rapidly analyze customer reviews using.

Dataset & Preprocessing

For model training, we used “Amazon Product Data” which contains Amazon Product Reviews for all departments between May 1996 — July 2014.* This dataset contains the following features:

reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin — ID of the product, e.g. 0000013714
reviewerName — name of the reviewer
helpful — helpfulness rating of the review, e.g. 2/3 (note: we created “pos_vote” and “total_vote” based on the raw data, which had both values in a list, and then we divided the two values to create our own “helpful” variable — see image below)
reviewText — text of the review
overall — star rating of the product
summary — summary of the review, written by the reviewer
unixReviewTime — time of the review (unix time)
reviewTime — time of the review (raw)

What the data looked like, including the variables which we created (pos_vote, total_vote, helpful, helpful_true, and sentiment).

We used a subset of the reviews in the 5-core (all users and items have at least five reviews) “Electronics” product category which received at least 5 helpfulness votes to ensure that we had some information on the helpfulness for each review. The main portion of the data that we used was found in the “reviewText” variable, which we processed to use as the input of our models.

Taking a look at our data, we noticed that there was quite a bit of skew in a few distributions. For example, the number of reviews made on each product is right-tailed while the mean ratings for each product are left-tailed.

The mean numbers of helpfulness votes and mean numbers of total votes per product were both right-tailed. Perhaps curiously, there seems to be a very weak positive correlation between mean ratings and mean numbers of total votes, which we speculate to be related to the popularity of highly rated products.

We believed that binary labels, though simplistic, were also extremely useful in our goal to simplify the process of analyzing review text. Thus for each modeling task, we created binary response variables based on the dataset.

Helpful vs. Not Helpful — generated based on the ratio of helpful votes to total votes on a review. A review with a ratio greater than or equal to 50% (i.e., at least half of the votes indicated that the review was helpful) was labelled “helpful.”
Positive vs. Negative Sentiment — generated based on each review’s star rating. A review with an overall rating greater than 3 stars was considered positive.

Some of the other data pre-processing steps can be viewed in our code.

Models

We used the BERT model for a large part of our language tasks described later. BERT is one of the most state-of-the-art NLP models currently being used for language related tasks, including Google Search. It outperforms many NLP models and obtains impressive performances on multiple NLP tasks according to Devlin et al.(2019). Moreover, it can be easily fine-tuned by using pretrained models from Hugging Face. Thus, we picked the BERT model for our major language processing tasks.

We will not go into depth on how BERT works in this article, but some resources for learning more about the model include Chris McCormick’s BERT Research Series (a YouTube playlist covering word embeddings, attention, positional encodings, masked language models and fine-tuning) as well as Yannic Kilcher’s videos on transformers and BERT.

For other NLP tasks, we also utilized a few language models besides BERT, and we will discuss them later on.

Helpfulness Analysis of Reviews

For products with multiple reviews, it is often overwhelming for a user to choose which reviews to read and consider. On many platforms like Amazon, other users can vote on whether a review was helpful or not. However, if a review has not been seen by many people yet and has no votes, how can a potential customer determine if the review is useful? We believe that, while browsing the products and going through reviews, users may want to know how helpful those reviews are, and whether they provide valuable information. Thus, we proposed a language model that can provide insights on whether a given review is helpful or not based solely on the review text. It would also be interesting to compare the solely text driven result with vote metric from users and see if there is alignment.

We applied transfer learning on a pre-trained BERT model for classification from Hugging Face. The model was fine-tuned to be able to predict whether a given review was helpful or not based on the review text. Since the dataset is very imbalanced in the number of “helpful” reviews versus the number of “unhelpful” reviews, we decided to downsample to achieve better model performance. We applied the pre-trained BERT specific tokenizer on the review text and trained the model with five epochs.

After training, our model reached a 75% overall accuracy on the validation data. However, based on the loss and accuracy plots, we find that the model begins to overfit as the validation accuracy plateaus over the epochs and starts to decrease slightly. This is not surprising as we speculate that the pre-trained BERT model likely did not “learn” much from our comparatively small reviews training data.

Sentiment Analysis of Reviews

A classic problem in NLP, sentiment analysis has been explored in many domains using many techniques. As part of our “suite” of functionalities, we also wanted to provide insight on the sentiment of a given review, in the spirit of breaking down a potentially lengthy review into something easily comprehended and digestible to a user.

We trained another pre-trained BERT model for classification from Hugging Face. This time, the response variable was the class label (positive vs. negative) generated from the star-ratings of the training data. We downsampled to balance the data, which made the model more useful and less likely to overfit. We applied the pretrained BERT specific tokenizer on the pre-processed text data and trained the model for five epochs. After training, our model reached an overall 75% accuracy on the validation data. As with our other fine-tuned BERT models, we find that the model begins to overfit slightly.

Generating and Detecting Fake Reviews

One of the most challenging problems which customers face online is the proliferation of fake reviews. Research has been done on detecting fake reviews based on review content, reviewer behavior, and product features. Much of this research is based on the Yelp reviews dataset by Rayana and Akoglu, and the fake reviews in this dataset are likely written by people paid to write them. However, fake reviews and spam are taking a new form as AI algorithms become better and more efficient at text generation. We took a unique approach to this problem by first using a state-of-the-art text generation model to augment our own dataset, which we then used to fine-tune a BERT classification model.

First, we leveraged OpenAI’s GPT-2 model to generate fake reviews. GPT-2 was pre-trained on large corpus texts and only needed to be fine-tuned to our specific domain: electronic product reviews. The model itself is designed to predict the next word based on the previous words. Thus human-like texts can be generated by looping the function over an arbitrary number of times. As of today, the full trained model has not been released due to concerns of malicious abuse. However, they have released a smaller version of it for research purposes. We used this model for our training. Although the model itself was from Hugging Face API, we utilized gpt2-simple library which wrapped the model to make fine-tuning with parameters much easier. This helpful Github repository provides a Google Colab Notebook that can be used to train on custom .txt data files. Based off of this notebook, we decided to train with 1000 steps to create our “generator model.”

The next step was to generate the dataset that the discriminator would train on. With the generation taking a considerable amount of time, we resorted to creating 10,000 fake reviews. Then we randomly sampled 10,000 real reviews that our dataset contained. Although there were a lot more in the original dataset, we wanted to keep the data balanced. Furthermore, we wanted the fake reviews to be as similar to the original dataset in terms of length of words/sentences. Thus, we looked at the distribution of the length of reviews in the original dataset and sampled lengths so that our 10,000 fake reviews would be similarly distributed.

With the dataset created, we used the BERT model to do the two-class classification for reviews. The Hugging Face API allowed us to directly finetune the model for our task. We trained the model around 3 epochs. This model was the one that we saved to be served on our web application. In practice, it would take in a user review (either from the user themselves or the Amazon product page) and output a prediction on whether that review was real or fake. The model was able to reach 93% accuracy during the training process.

Summarization

Going through product reviews before placing an order is a common practice for customers these days, and as e-commerce platforms accumulate more data, it has become increasingly more time-consuming for shoppers to sift through reviews for relevant information. Is it possible to leverage AI to transform long reviews into a more concise and to-the-point summary, and even aggregate hundreds of reviews into a few topics and keywords to help consumers reach a buy or no-buy decision?

In addition to providing easily digestible classifications of reviews, we sought to provide users of our web application with useful summaries of product reviews. Here we utilized the open source library Transformers from Hugging Face, using a pre-trained BART model in Transformers 4.0 to summarize long reviews into a shorter, more user-friendly snippet. The outputs of this pre-trained model look reasonable to us, though it could be interesting to use transfer learning and further fine tune this model. Here is a brief summary of the BART model.

The architecture of BART is similar to BERT in the sense that it uses a standard seq2seq/machine learning translation with bidirectional encoder, and like GPT as a left-to-right decoder.
The pretraining task involves sentence permutation and text infilling, where spans of text are replaced with a single mask token, which is inspired by SpanBERT.
BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks.

We also wanted to provide key words from reviews on a particular product and decided to use LDA (Latent Dirichlet Allocation) for topic modeling. The key assumption that LDA employs is that the way a document was generated by picking a set of topics and then for each topic a set of words. When we apply LDA to aggregate reviews, the number of selected reviews will determine the appropriate number of topics, and for simplicity, we set the default number of topics to be 1 in the demo. Here is more detailed about the LDA.

A generative probabilistic model for collections of discrete data such as text corpora.
A three-level hierarchical Bayesian model, and each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities.
In the context of text modeling, the topic probabilities provide an explicit representation of a document.

Example input and output of BART summarizer model

Live Data Functionality

With our models created, we wanted to incorporate live data where users could search any product through our web application and be able to get information. This process was done through Selenium. The flow went as follows:

The web application starts a web driver
User inputs a product search word (e.g “tv”)
The web driver searches that term on the Amazon webpage
The link to the top product of the results page is scraped and the driver navigates to that product page
The top ~8 most voted reviews provided by Amazon are collected and then returned to the application

These reviews were then used as inputs in our BART and LDA models, allowing us to generate summaries and key words. Though we did not implement this, these reviews could also be inputted into our helpfulness, sentiment, and fake/real classification models as well.

Deployment

Our web application utilizes four total functionalities described earlier. The web application allows the user to put in review texts and Amazon product names on Amazon. If the user inputs a review text, our trained models are going to predict its helpfulness, sentiment status, as well as the authenticity for the given review text. If the user inputs a product name, our model is going to summarize the top reviews collected from the Amazon page for this particular product. For the users’ convenience, we deployed the web application using docker containers so that users can directly use containers that provided virtual environments with all packages installed for running the web application. We created two different docker containers in total, one for the front end page and another one for the database page. The deployment of our web application required several steps. First, the docker images for both front end and database needed to be built from the docker files. Next, a network connection was constructed in order to link the two docker images together. Finally, the docker containers were run and the web application was ready for the users to explore. We tested the deployment in the local hosts, and it’s straightforward to deploy the application using the GCP, AWS or Azure platforms.

* There is a more recent 2018 version of the Amazon reviews dataset, but we decided not to use it as the “helpfulness” votes only show positive and not total votes.

References

https://github.com/InsiderPants/AmazonReview-Sentiment-Analysis

https://github.com/jeremygrace/amazon-reviews

https://colab.research.google.com/drive/1Zv6MARGQcrBbLHyjPVVMZVnRWsRnVMpV (parse and clean the data)

https://colab.research.google.com/drive/12r4KJVbNqjjhiZ6aeiaG809x4-Tg5fm8?usp=sharing (find target products and obtain their reviews)

fairseq/examples/bart at master · pytorch/fairseq (github.com)

LDA Topic Modeling: An Explanation | by Tyler Doll | Towards Data Science

Bing, Lidong, Wong, Tak-Lam, & Lam, Wai. (2016). Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews. ACM Transactions on Internet Technology (TOIT), 16(2), 1–17.

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, & Toutanova, Kristina. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Ghosh, Swarup Kr, Dey, Sowvik, & Ghosh, Anupam. (2019). Knowledge Generation Using Sentiment Classification Involving Machine Learning on E-Commerce. International Journal of Business Analytics (IJBAN), 6(2), 74–90.

He, Ruining, & McAuley, Julian. (2016). Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering.

Lewis, Mike, Liu, Yinhan, Goyal, Naman, Ghazvininejad, Marjan, Mohamed, Abdelrahman, Levy, Omer, . . . Zettlemoyer, Luke. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.

Radford, Alex, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sustever, Ilya. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.

Shrestha, N., & Nasoz, F. (2019). Deep Learning Sentiment Analysis of Amazon. Com Reviews and Ratings. arXiv preprint arXiv:1904.04096.

Blei, David, Ng, Andrew, & Jordan, Michael. (2003). Latent Dirichlet Allocation