Amazon Review Rating Prediction with NLP

Anqing Chen, Jonathan Walsh, Matt MacDonald, Nicholas Chu, Ryed Ahmed, and Saaketh Rao

Published in

Data Science Lab Spring 2021

23 min readMay 7, 2021

Introduction

The following article describes the application of a range of supervised and unsupervised machine learning models to a dataset of Amazon product reviews in an effort to predict rating value. The desired output to an input of a text review is a “star” rating on a continuum from 1 to 5. Because we wish to produce a continuum of sentiment rather than just polarity, this project could be categorized as a regression variation of sentiment analysis. Our goal is to produce the most versatile and accurate model that handles the widest range of mixed and polarized sentiments expressed in reviews. On top of trying multiple model types, we analyzed the impact of seven different embeddings, each of which had different built-in considerations for information like word index, directional significance, context, and frequency. These embeddings ranged from pre-trained word embedding models like BERT and Word2Vec, embedding schemes computed from scratch, and simpler encoding options like Bag of Words and TF-IDF. We trained models including supervised boosting models from Light GBM and CatBoost as well as three different deep learning networks which were more successful.

The article will first describe the curation of the dataset utilized to train and test the models, describing all review selection methodologies, pre-processing steps, and feature engineering. We will then dive into some of the differences behind different embedding techniques used, their context for our project, and their impact on model score. Then, the next section will describe the technical meat of the article — each attempted model and its corresponding score/lessons learned. There are also some unique quirks of this project that result in some unavoidable shortcomings, which will be discussed afterward. The article will then conclude with overall results, major findings, and recommendations for improvement and future work.

Potential applications include auto-generated suggestions for rating sentiment, falsified review detection, or other use cases where regression-type sentiment analysis is useful.

Dataset

We obtained our data from the Amazon Customer Reviews Dataset provided by Amazon Public Datasets, which contains official reviews from shoppers at Amazon.com. Here are the columns of the dataset along with a brief description of each:

marketplace — two letter country code of the marketplace where the review was written
customer_id — random identifier that can be used to aggregate reviews written by a single author
review_id — the unique ID of the review
product_id — the unique Product ID the review pertains to
product_parent — random identifier that can be used to aggregate reviews for the same product
product_title — title of the product
product_category — Broad product category that can be used to group reviews
star_rating — the review rating on a one to five star scale
helpful_votes — number of helpful votes
total_votes — number of total votes the review received
vine — whether the review was written as part of the Vine program
verified_purchase — the review is on a verified purchase
review_headline — the title of the review
review_body — the review text
review_date — the date the review was written

We decided to focus on the electronics category dataset of reviews because we thought that reviews for these products would contain more objective evaluations based on concrete characteristics compared to more subjective product categories such as books.

Feature Selection

In terms of sentiment analysis of reviews, the dataset provides some extraneous features that we do not believe are needed for training a proper model. We decided to only consider reviews written by verified purchasers to decrease the risk of fraudulent reviews with dubious ratings. Only star_rating, review_headline, and review_body columns were considered to reduce feature complexity.

Both review_headline and review_body are just text features, and we think they are indistinguishable. As a result, review_headline and review_body were concatenated and delimited by a space to further reduce feature complexity.

Train/Test Split

We chose to consider only a small subset of the dataset, as the entire electronics dataset contained more than three million rows of reviews. We performed an 80:20 train/test split across 50,000 uniformly sampled reviews based on rating, ensuring a stratified split to preserve a balanced dataset. We utilized the same resulting train/test split across all training and evaluations to ensure a consistent comparison of accuracy and loss among different models.

Data Preprocessing

Text preprocessing is an important step for any natural language processing task. It transforms text from its raw format into something that is more processable for computer algorithms. The goal of our preprocessing is to achieve normalization and noise removal from the dataset before we experiment with different embeddings and models.

Normalization

We first set out to normalize the dataset by converting all the characters to lowercase. This helps us even the playing field for words since lexically a capitalized word should not have a different weight than a non-capitalized word. We also convert all whitespace and punctuation into a single space to get rid of any inconsistencies.

Noise Removal: Removing HTML Tags

The reviews seem to be web-scraped so it contained many special HTML tags and characters such as <br /> or  . These tags are only for HTML formatting and are thus useless for our NLP task. We remove them using BeautifulSoup’s HTML parser to get plain text.

Noise Removal: Expanding Contractions

Contractions are short words made by putting two words together.

This adds noise to our data set since there is no lexical difference to these words and thus should be made uniform across the dataset. We used regex to write a de-contract method that essentially finds and replaces the apostrophe-letter format into a full word. We made an observation that replacing “n’t” to “not” is not viable in all cases. It works for “isn’t” ⇒ “is not” but will break “can’t” ⇒ “ca not”. We created special cases for these situations.

Noise Removal: Removing Stop Words

Stop words are common words that structure a sentence. Words such as “I”, “are”, and “here” do not contribute to the sentiment — the rating in our case — of reviews. Hence, we decided to remove stopwords to further denoise the input. We used NLTK’s stopwords package to provide us with the list of stopwords. Here, we made an adjustment to avoid the removal of certain negation stop words, namely“not” and “no”, since they do indeed influence sentence meaning. A product that is “not good” is certainly different from a “good” product.

Other Considerations: Lemmatization

Lemmatization is the process of converting a word to its base form. Depending on the package used, there could be varying results. Porter’s algorithm is an empirically effective method to stem the words. The algorithm follows an intricate set of sequential rules for replacement and produces universal results. We initially used the Porter Stemmer from NLTK. However, after some research into pre-computed embeddings, they seem to be calculated without stemming so we decided against stemming in preprocessing as well. Though some models described may have implemented a stemming step as an experiment.

Other Considerations: Stemming

Stemming is a more crude form of lemmatization. It’s a lot faster than lemmatization since it basically just chops off the ends of words without too much intricate consideration. This could provide some inaccurate results (e.g., stemming “caring” to “car”). We decided against using this method.

Other Considerations: Treating Numbers

We considered converting alphanumeric numbers into English words for the sake of consistency. However, there are some intricacies that need further study. We were not quite sure if alphanumeric numbers would represent the same as an English word, as well as difficulties involving the conversion of multi-digit numbers. We decided we would just keep the original format instead.

Transforming the Text into Vectors

Before talking about what models we employed, let’s first talk about the idea of embeddings. When attempting to train a machine learning model on text, several steps are required to transform the text into the format necessary for training. Although describing these steps in detail is outside the scope of this article, the words must first be tokenized and then encoded into vectors of numbers. Some algorithms for accomplishing this transformation are simple, like Bag of Words which cares only about the presence of a word in the document and its frequency — nothing about order or context. Other embedding schemes are massively pre-computed models that create numerical significance in the output vector space for information like definition similarity (e.g. “bad” and “awful”), direction, and context. Schemes in this category, like Word2Vec, GloVe, and BERT, all have the high-level goal of optimizing a given model which creates the text embeddings.

We found there to be little significant difference in model performance for the simpler encoding schemes like Bag of Words and TF-IDF. However, the pre-computed word-based embeddings performed the best, specifically BERT, which is a pre-computed NLP model from Google that had to be optimized via stochastic gradient descent.

Bag-of-Words (BOW)

Bag-of-Words is perhaps the simplest algorithm for encoding a document or paragraph of text. It simply takes a tokenized set of feature words and counts the number of occurrences of each word. As more words are found in the train set, new feature columns are added, and each additional sample will create a sparse vector with values in each column that corresponds to the number of occurrences of that word in the sample. Drawbacks include the destruction of crucial information like context and direction, massive sparse matrices that increase resource consumption, and the potential to encounter new words in the test set. You can easily create a Bag-of-Words representation using the CountVectorizer import from sklearn’s feature_extraction library.

Term Frequency-Inverse Document Frequency (TF-IDF)

This approach to encoding is also fairly simplistic but carries more information than just the number of occurrences, like Bag-of-Words. TF-IDF is a numerical statistic that also reflects how important a word is to a document or paragraph. The scheme is split into two calculations, TF and IDF.

TF, or Term Frequency, is the number of times a term, t, appears divided by the number of terms in the document/paragraph, d.

Equation for Term Frequency

IDF, or Inverse Document Frequency, is a measure of how important the term is in the corpus. It’s calculated by taking the log of the number of documents divided by the number of documents with the term t in it.

The TF-IDF encoding scheme outputs TF × IDF for each term in each review. It generally performs better than Bag-of-Words, but for our project, it didn’t yield great results as it still destroys too much information.

Word Embedding From Scratch

We used the Embedding layer from keras to train a word embedding customized to our vocabulary from scratch. The layer takes in an integer encoded input data. We obtain this sequence using the tokenizer also provided by keras in the preprocessing package.

Below, we show a comparison between a 100-dimension embedding trained from scratch versus the GloVe embedding obtained from the 6 billion corpus with 100 dimensions. For presentation, they were reduced to 2D using Principal Component Analysis (PCA). Interestingly, they look very different.

A visualization of the embedded trained using a bidirectional LSTM model (left) compared to GloVe (right)

Word2Vec

Word2Vec is a technique that utilizes neural networks to learn word associations from large datasets, converting input text into a series of vectors. Word2Vec offers two different methodologies to train and learn word embeddings: continuous bag-of-words and continuous skip-gram.

The CBOW architecture predicts the target word based on a sliding window of other surrounding context words. CBOW differs from vanilla BOW in that the words before the target word are treated as a separate bag from words after the current word. Similar to BOW, the order of words in each bag is not considered. The window size defines the number of context words to be considered before and after the target word.

The continuous skip-gram architecture acts as an inverse to CBOW. Instead of predicting the target word from a window of context words, a continuous skip-gram predicts a window of context words based on the target word. Given the target word, Word2Vec aims to maximize the probability of predicting context words.

In general, CBOW performs better on large datasets whereas continuous skip-gram performs better on small datasets with infrequent words. Visit this documentation for more details about Word2Vec.

Doc2Vec

Doc2Vec is essentially just a generalization of Word2Vec, which generates feature representation vectors for entire documents, rather than for words. This NLP embedding would work better if our project goals revolved around comparing sentences in an individual review or finding similar overall reviews, however, we care more about individual word semantics and their effect on the overall rating. Words only carry so much meaning, and sometimes sentence representations are needed to garner more semantic information, but not in our case.

GloVe

GloVe is a word embedding technique that doesn’t only rely on local statistics but also global word statistics (word co-occurrence). GloVe assumes that you can derive relationships between words from a co-occurrence matrix.

Co-occurrence matrix for “roses are red violets are blue” (note the symmetry)

GloVe uses a weighted least squares objective to learn word vectors that minimize the difference between their dot product and the logarithm of their number of co-occurrences. In other words, one can predict the probability of co-occurrence of two words by taking the dot product of their GloVe vectors.

Probabilities from the 6 billion word corpus. Source: *GloVe: Global Vectors for Word Representation*

In our experiment, we obtained pre-trained word vectors from Stanford. We chose the Wikipedia plus Gigaword 5 version with 6 billion tokens and 400k vocab, specifically the 300-dimensional version. We felt this was enough for a successful model and not take up too much computation/disk space.

fastText

fastText is another word embedding and text classification library created by Facebook that is an extension of the Word2Vec model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end of the word. This is generally helpful to understand the importance of things like suffixes and rarer words. It provides a representation for words that are not in the train model dictionary, which is something that Word2Vec and GloVe do not do. It can break down made up words like stupedofantabulouslyfantastic into its n-grams, creating a vector for different partitions of the word that might result in the representation for that word looking close to fantastic or fantabulous. Word2Vec would return either a 0 vector or a random vector with low magnitude.

BERT

BERT stands for Bidirectional Encoder Representations from Transformers and is a beefy NLP embedding model developed by Google trained on a massive corpus including all of Wikipedia and the Book Corpus. It’s the current cutting edge model for a lot of modern NLP applications, like for example the Google search engine. BERT is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. You can train BERT on your own, but that takes days even for supercomputers. We used a pre-trained BERT model, but then still had to tune and optimize the embedding scheme off of our review corpus, which took several hours for even just 3 epochs. This two-step process, training an NLP model off a massive corpus and then fine-tuning the model to specific NLP tasks through supervised training, is the future of NLP, and the main advantage BERT offers.

Models

We tried a variety of models, both supervised and unsupervised. We first tried a few classification models, although these were quickly proven to be vastly inferior to regression models, based on the nature of the project. Our unsupervised deep learning models employed a variety of the aforementioned embeddings.

Classification Models

Although this project started off with the intent to produce a regression-based model, we wanted to experiment with the idea of performing classification-based sentiment analysis. Namely, instead of outputting a rating value on a continuum from 1 to 5, we instead would classify a review as either 1, 2, 3, 4, or 5 stars.

Classification Model: BERT

The first classification model we tried was a neural network utilizing a BERT embedding scheme and a binning of sentiment (e.g. 1 and 2 were negative, 3 was neutral, and 4 and 5 were positive). The model was solved with deep learning via the AdamW stochastic gradient descent solver. The loss metric used was accuracy, a simple calculation of the percent of correctly predicted labels. However, even with sentiment binning, the model accuracy never rose higher than 83% on the test set, although it approached 100% train accuracy.

Model performance metrics of BERT classification model

As shown by the following heat map, the most commonly difficult prediction was the neutral sentiment, which makes sense as a 3-star rating is closer to a 2 or 4-star rating than a 5-star rating, which comprises half of the positive bin, is to a 3 (the same applies to a 1-star rating).

Heat map of BERT classification model on test set

Classification Model: fastText

fastText is another word embedding and text classification library created by Facebook that is an extension of the Word2Vec model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end of the word. This is generally helpful to understand the importance of things like suffixes and rarer words. It provides a representation for words that are not in the train model dictionary, which is something that Word2Vec and GloVe do not do. It can break down made-up words like stupedofantabulouslyfantastic into its n-grams, creating a vector for different partitions of the word that might result in the representation for that word looking close to fantastic or fantabulous. Word2Vec would return either a 0 vector or a random vector with low magnitude.

Classification Model: CatBoost

For CatBoost, we did no sentiment binning and simply evaluated how the CatBoost Classifier did on a multiclass problem where classes were the values 1, 2, 3, 4, or 5. Although we conducted many feature engineering steps, our test accuracy never rose higher than 55%. Random labeling would result in an accuracy of 20%, so this model at least provided a decent degree of prediction capability. However, this accuracy metric starts to break down when you realize that a 5 star identified as a 1 star is equally punished as a 5 star identified as a 4 star.

Furthermore, as shown below, the model had problems even predicting a neutral review (3 stars) as the number of times it misclassified the review as either a 1, 2, 4 or 5 stars is much more than the number of times it correctly classifies the review, not to mention the abysmal misclassification rate for the most negative reviews (1-starred reviews).

CatBoost Histograms for 3 Star Predictions and 1 Star Predictions

Additionally, we also realized that predicting the rating of a review is inherently a regression problem as reviews are on a continuous linear scale so it does not make sense to force it into a classification problem.

This realization was the final nail in the coffin for classification models, and we pivoted to regression-based ones.

Regression Models

Classification-based models had some severe shortcomings in the context of this project. We need to predict a continuum of review ratings from 1 to 5, and the presence of mixed sentiments, slight polarization, and loss function difficulties made the classification models struggle, even when the problem was made less of a multi-class problem with sentiment binning.

Regression makes more intuitive sense for the star rating aspect of the project, which is essentially a numerical sentiment strength. With regression, we used Root-Mean-Square Error (RMSE) as our loss metric, which would tell us on average how many stars away our label was from the actual value. This approach also makes more intuitive sense, as we might sometimes want to classify a review as 3.5 stars or any other number between an integer value. For these regression models, we normalized the labels to be from 0 to 1 instead of 1 to 5 by dividing all ratings by five. This means a label of 0.2 equals 1 star, 0.4 equals 2 stars, etc. An RMSE value of 0.1 suggests our labels are predicting a half star away from their actual value, on average. As we will talk about in the shortcomings section, it is virtually impossible to get a test RMSE value close to 0 based on the nature of the problem.

Regression Model: LightGBM

To get a baseline for regression RMSE, we encoded the review text with TF-IDF and fit an untuned Light GBM Regression model. The RMSE value on the test set was 0.178, aka an average of 0.89 stars away from the actual review value. The distribution of those predictions is in the below graph, and you can see a relatively high amount of overlap between the predictions. Still, not a bad start.

Predicted rating distributions of LightGBM model on test set

Regression Model: CatBoost Regressor

We decided to implement a Bag of Words model as we were curious about how well such a model would predict the rating of a review. So first, we started out by performing some feature engineering by utilizing NLTK’s Sentiment Intensity Analyzer function to calculate the sentiment scores (positive, negative, neutral, and an overall summary score) for each review which we then added to the train set and test set correspondingly. We also added some simple features such as the number of characters and the number of words in the review to the training and testing dataset. Next, in accordance with the next step of any bag of words model used for sentiment analysis, we created the vector representation of each review text by using the Gensim module’s Doc2Vec word embedding model which we added to the datasets. Lastly, we utilized a TfidfVectorizer on every word in every document to calculate the relative importance of each word in the review texts and from all those values, we filtered them out in such a way that only the columns for a word that appears in at least 10 reviews were added to our dataset so as to prevent an explosion of features, which would make calculations rather time-consuming. With all these feature engineering steps were completed, we then decided on training a CatBoost regression model on our augmented training dataset for 100 iterations, which gave us an RMSE of about 0.17 on the test dataset, which was comparable to the RMSE we received from one of our better-performing models.

Regression Model: ReLU (Baseline for Neural Network)

For this experiment, we referenced an introductory post about sentiment analysis. We loaded the pre-trained “word2vec-google-news-300” embedding provided by Google. This Word2Vec embedding was trained on a Google News dataset and provides 300-dimensional vectors across a vocab size of 3,000,000. For this model, we considered the top 20,000 most frequent words in the tokenization of the train set. We created an embedding matrix for these words using vectors from the loaded Word2Vec embedding and passed this matrix as weights to the embedding layer of our sequential model. Next, we added a flatten layer and a dense layer with Rectified Linear Unit (ReLU) as the activation function. ReLU is defined as the maximum between zero and the input. This forces non-linearity on the output of neurons in a neural network. For the final layer, we added another dense layer with sigmoid as the activation function. Sigmoid is defined as

Sigmoid function

These forces output in the range of zero to one which is essential to the model due to how we normalized ratings, preventing a rating prediction below zero stars or above five stars.

With this model, we achieved an RMSE of 0.173 on the test set. From the graph of model loss below, the train loss decreases as the number of epochs increases, but the test loss appears to stay fairly constant. This indicates the possibility of the model overfitting on the train set. We could add l1 and l2 regularization or dropout layers to discourage overfitting, but we decided against this as this model will act as a baseline for comparison of other models.

Model performance metrics of baseline ReLU model

A distribution of predicted ratings on the test set is shown below, grouped together by their true labels. This distribution seems adequate for a baseline model, but the standard deviation for each group of reviews appears to be quite large.

Predicted rating distributions of baseline ReLU model on test set

A table of descriptive statistics about predicted ratings on the test set is shown below, grouped together by their true labels. Based on the mean and standard deviation, the predicted ratings for one- and five-star true label reviews appear severely biased.

Descriptive statistics of baseline ReLU model on test set

Regression Model: 1D Convolution Layer

Taking inspiration from the 2D CNN for image recognition tasks, we decided to construct a 1D CNN to extract patches and apply convolution to all of them using trained filters. The model overfits extremely fast, so we experimented with l1 and l2 regularization on the Conv1D layers. However, this did not seem to improve the test loss. We decided against pursuing this approach any further. In the end, we achieved an RMSE of 0.160 on the test set.

Model performance metrics of 1D CNN model

*Predicted rating distributions of 1D CNN model on test set*

Descriptive statistics of 1D CNN model on test set

Regression Model: LSTM/GRU

We will now attempt this problem with a much more powerful technique called recurrent neural networks (RNN). RNN was designed to be able to catch and understand sequential/time series data which is very important in our NLP task as well. As we know, the order of words matters. A review saying a product is “not good” is vastly different from a review saying a product is “good, but not perfect”. We want our model to be able to understand word order to distinguish between these cases. However, RNN suffers from short-term memory. In a sufficiently long input, i.e. a long review, RNN may leave out important information from the beginning. This is called the vanquishing gradient problem.

In our experiment, we implemented our network with Long Short-Term Memory (LSTM) and Gate Recurrent Units (GRU) to circumvent these issues. In simple terms, these layers are able to determine memories to retain or forget. We also apply a moderate 0.5 dropout rate in our experiment to regularize the RNN. This basically randomly turns off neurons and forces the RNN to not rely fully on hidden units but find more meaningful connections in the data. A single LSTM layer and training embeddings from scratch was able to achieve a 0.15 RMSE on the test set.

We now also attempt to improve our model by using Bidirectional layers. Traditional RNNs rely on the past and present values, however in language processing, a sentence wouldn’t make sense without the whole sequence. A Bidirectional RNN enables us to “see the future” by using a combination of two RNNs, one that moves forward and one that moves backward. By doing this, they are able to capture more complex relations and patterns a single RNN layer may not be able to catch.

In our experiment, the best attempt was achieved by combining Bidirectional RNN layers with LSTM and GRU with a moderate dropout rate and evaluated with a dense layer with sigmoid activation for the final result. We also decided to use GloVe 6B-300d embeddings instead of training our own for a more meaningful result in the context of a wide vocabulary. We achieved 0.142 testing loss with this configuration. Even with a dropout layer, the model still overfits very fast.

Model summary for the best model achieved with LSTM/GRU

Model performance metrics of LSTM/GRU model

Predicted rating distributions of LSTM/GRU model on test set

Descriptive statistics of LSTM/GRU model on test set

Regression Model: BERT

And finally, we arrive at the most cutting-edge NLP model on the market, BERT. The model had very few layers and used the BERT model variation from this link: https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1.

The neural network model summary looked like the following:

Training the model for 3 epochs took the following amount of time, with the related MSE values (note that RMSE is these values square rooted).

Once applied to the test set, the RMSE loss was 0.136, a little more than half a star away from the actual value. The distribution looked like the following, and you’ll notice steeper peaks around the correct prediction value, and fewer values farther away from the actual label.

Predicted rating distributions of BERT model on test set

Results

Model Comparisons

Playing Around with Artificial Inputs to Best Model (BERT)

Shortcomings

Discrepancy Between Review and Rating

The biggest shortcoming of this project is an unavoidable one: different people associate star ratings with different sentiment polarities, especially for the 2-, 3-, and 4-star ratings. General satisfaction with a product, but with some minor problems, might result in a 4-star rating for one individual but a 3-star rating for another individual. This results in an understandable amount of built-in loss to our model. In fact, this realization allowed us to debug models that were suspiciously performing at accuracy levels over 90%, or at RMSE values close to perfect. At some point, there is no way to perfectly replicate the star rating of a population of Amazon reviewers who have different rating beliefs for similar sentiment strength.

Output Scale

As discussed in the regression section, we divided all ratings by five in order to standardize true labels between zero and one. However, we realized afterward that this approach improperly restricts predicted ratings that exceed five stars. Whereas a prediction on the low end permits zero stars (0 * 5 = 0), a prediction on the high end is limited to five stars (1 * 5 = 5). Consequently, predicted ratings with five stars as the true label exhibit a skewed distribution in the negative direction.

Double Negation and Mixed Sentiment

For the less advanced models, double negation and mixed sentiment was sometimes not factored into the predicted label as much as it should have been. Models that accounted for bi-directional representation did the best with this topic. Typically, reviews with these characteristics were given 2, 3, or 4-star ratings, which were shown in this article to be the most difficult to classify (e.g. “product was not awful”, “the product was good but had some problems”, “I did not hate the product, but …”, etc.)

Inconsistent Word Usage/Capitalization

We lowercased every word as part of our pre-processing, because in the vast majority of cases, capitalization does not affect the lexical impact of a word. However, this did not account for instances where the entire word is capitalized to show more intense sentiment. The sentence “I did NOT like the product” should likely be labeled with a lower rating than the sentence “I did not like the product.”

Conclusion

Regression-based sentiment analysis seems to not be documented much on online forums, however, it is easy enough to accomplish with tweaks to more common classification sentiment analysis. With relatively no hyper-parameter tuning, we were able to train models ranging from simple boosting frameworks to the most cutting-edge NLP models with a good level of success, especially as we started utilizing deep learning. Different embedding techniques had varying strengths and weaknesses including processing time, resource utilization, information capture, and versatility. Ultimately, the best performing model was BERT, which achieved an RMSE value of 0.136. Future work could build on these models by increasing the training set, improving pre-processing, and accounting for the shortcomings listed above as best as possible. One could also download a more sizable version of BERT, although computing time would rapidly increase.

The idea of regression-based sentiment analysis can be used for a variety of practical purposes. It could auto-generate rating values to match your review text, detect fraudulent rating values for a text that does not match the rating to a certain degree, or be used for any other case where regression sentiment analysis is useful rather than simple classification (positive or negative).