Sentimental Recommendation System

Unsupervised Sentimental Learning, Opinion Mining, Word2Vec, Review Contents

9 min readMay 13, 2022

In this article, we will discuss an implementation of Unsupervised Machine Learning concept as a project, wherein the emotions, thoughts and opinions extracted as a mathematical notion from the text are the determining factors of a sentiment that aids to the recommendation.

Dataset: Amazon reviews datasets, segregated based on Categories was used. For instance, reviews for different musical instruments such as Guitar, violin,etc. are massed together as one dataset. Likewise, each category has a dedicated dataset.
Code: The source code is available here
Team/Contributors: Aanand Dhandapani

Broad overview of the project is, retrieve a user’s search query (Product & Category), based on which the user will be recommended top-n products from that category alone. The Underlying mechanism in simplest terms is to figure out the sentiments of the reviews for the products corresponding to the category being searched, either as positive or negative, followed by clustering unique items of that category to decide top-k items based on higher average of connotation scores.

The full-length implementation consists of the following strides:

Preprocessing (or Tokenizing)
Word Embeddings, Clustering and Connotation Calculation
Hyper-parameter Tuning and Evaluation

So, here’s the process:

Preprocessing

The datasets contains 15 features such as ‘marketplace’, ‘customer_id’, ‘review_id’ and many more. The relevant ones are picked, and the rest are dropped along with observations with missing values. These datasets have a ‘review_body’ column, which contains string sentences (text) as feedback for corresponding products. The following snippet will immaculate the reviews and output tokenized text.

Word Embeddings, Clustering and Connotation Scores:

To understand word embeddings, lets first build up on word2Vec technique and the two architectures on which word2Vec model works to generate word embeddings.

Words as input cannot be interpreted for calculations by any algorithm, hence the need to represent them as a mathematical notion which is achieved by word2Vec model that converts words into vectors that near accurately represents the meaning mathematically. Each word is represented as a vector of weights/components wherein, each component symbolizes a feature that holds a numerical value which portrays the degree of extent to which the feature best describes the word.

Word2Vec:

Given the bulk of text (all reviews collectively), a vocabulary consisting of unique words is generated. A one-hot encoded vector is created for each word, where the number of components (rows) corresponding to words from vocabulary is equivalent to the length of vocabulary. The word that the one hot-encoded vector represents is filled with a ‘1’, ‘0’ otherwise. (see the figure below)

A window of fixed size slides over the given corpus of text, wherein each window is a training sample with the first word as the target and the remaining words as context that is used as input. For example, in the image below, the window size is 3, where the first word is to be predicted (target) based on the next two words (context). [refer to the image below]

The input one-hot encoded vectors (of the context) are given to the neural networks resulting in an output vector that is just random (not one-hot encoded) for the first epoch. The loss is calculated between randomly predicted and actual one-hot encoded output, propagated backwards through the network adjusting the weights to minimize the loss. This backpropagation occurs for multiple epochs until the max iterations are performed or the loss is minimal. (as in the figure below)

Once, the error is minimized after certain number of epochs, that is the predicted value is almost near to actual value, the weights pointing to the word being predicted are components representing that word. The vector of those weights is the word-vector representing the predicted word. (refer the image below)

The approach explained above is the foundation of CBOW architecture, that stands for Continuous-Bag-Of-Words. In CBOW, the context words are given as input to the neural network based on which target is predicted. The predicted mathematical notion of the word is based on its surrounding.

The second architecture is known as skip-gram (SG), which has similar working principles as CBOW architecture with a subtle difference of input/output. In Skip-Gram, the target word is given as input based on which its surrounding words are predicted.

“CBOW: Given the surrounding context (words), predict the word.
SG: Given the word, predict its surrounding context (words)”

Now that we have a solid foundation of what Word2Vec is and how does it create mathematical notion of words, let’s dive into the implementation.

Post tokenization of the reviews, an object of Word2Vec is created, which comes from the package ‘gensim.models’. There are some significant parameters that needs to be specified in the arguments while creating an object of word2vec such as minimum frequency of words, the window size, size of the hidden layer in neural network, etc. Then a function named ‘build_vocab()’ is invoked to build vocabulary of the tokenized reviews upon which the model is trained by calling the ‘train()’ function.

As soon as the training of the model completes, the model object contains the high dimensional mathematical representation of each word also known as the word-embeddings. The words are represented by 300 component/feature vector (since the default size of hidden layer is 300).

Now, K-Means clustering is performed on the word embeddings to cluster them into two groups and determine their centroids, using the KMeans model from the ‘sklearn.cluster’ package. The centroid word-vectors are retrieved using the parameter ‘cluster_center_’ accessible by KMeans object. The first centroid word-vector represents the cluster containing positive words while the second represents that of negative. The words in the vocabulary of Word2Vec model object is labeled as either positive (1) or negative (-1), depending upon the centroid that the word-embeddings are closest to.

Once, the words are determined as positive or negative, the connotation scores of each review is calculated to determine the sentiment of the reviews. Connotation score is nothing but the addition of 1’s for each positive word and -1’s for each negative word in a particular review.

For example, if the review is “Unsupervised Machine Learning is a mandatory course for data science graduates” and if according to word embeddings and its distance to centroids, for instance, following are the labels for semantic words:
{Unsupervised: +1, Machine: +1, Learning: -1, mandatory: +1, course: -1, data: +1, science: +1, graduates: -1}
So, adding the labels, (+1) + (+1) + (-1) + (+1) + (-1) + (+1) + (+1) + (-1) = 5–3 = 2 ≥ 0, hence a positive review (labeled 1), negative otherwise (labeled 0). This is the predicted sentiment of the review.

There are no actual values to contrast the results in order to evaluate the performance, so a notion of ‘high raters, well-wishers’ was considered. It implies that the users that rated a product highly would also provide positive feedback. It would be a rare case to observe an opposite pattern where the product is rated high while also being criticized. So, the true value was feature engineered to binary class, labeled as 1 if the product was rated 4 or above and labeled as 0 if the product was rated less then 4. Based on this True value and its contrast with the Predicted Value determined by the connotation score, the hyperparameter tuning was performed and evaluation metrics were ameliorated.

Hyperparameter Tuning and Evaluation

In this section of the article, the significant hyperparameters such as window size and size of the hidden layer will be emphasized. But, since the change in size of the hidden layer didn’t show much of an impact on the performance (which is weird or we were unable to make out the cause), the default size was chosen (i.e 300).

Note: The complete project was implemented on two reviews datasets from Amazon (Jewelry and Watches), where the preprocessing and model training were separately executed for both the datasets. It was due to the reason that the contextual meaning of a word in one dataset (say Home Essentials) differs from the contextual meaning of a word in another dataset (say Games), and since Word2Vec determines the view of a word based on its surrounding, it made more sense to not merge the dataset because effect of the word in one surrounding would be biased or neutralized by the effect of same word in another surrounding. But the implementation is generic and can be applied to any of the review datasets.

The determining factors were the window size, zero splits and mid-splits on the scale.

The zero-splits refer to the fact that if a connotation score was greater than or equal to zero, the review’s sentiment was tagged as positive, negative otherwise.
The mid-split refers to the point of split being the average value of the largest and smallest connotation score, wherein if the connotation score of a review is greater than the split average value, than that review’s sentiment is tagged as positive, negative otherwise.

The below plots show the effect of window size on Accuracy, Precision, Recall and F1-Score metrics individually for the Jewelry reviews dataset.

X-axis represents window size, Y-axis represents performance measure, the blue line indicates performance on zero-split and orange line indicates performance on mid-split.

It can be inferred from the above plots that mid-split gives the best results in most of performance metrics, though in precision zero splits performed better but mid splits results were well enough almost 75% for the window sizes 12, 13 and 14. For the same window sizes, the f1-score was observed to be nearly 85% and recall was an amazing 100%.

Note that the model training and hyperparameter tuning was done with more emphasis on positive class, hence plots only evince the evaluation of positive class as we’re inclined to recommending a product that had positive feedback rather than something that the general public is riled up about.

The numerical values of the evaluation metrics for the window size 11,12 and 13 for the mid split is as follows:

Also, the calculated connotation scores are on the scale of integers that can range to anything from negative to positive numbers. The mid-point of this range was used to determine the sentiment of reviews. The ROC-AUC Curve for when the connotation scores are integers (image below).

When the connotation scores were compressed using Sigmoid function to rational numbers in the range of 0 to 1, determining the sentiment with similar notion by setting 0.5 as the threshold, wherein the compressed values greater than or equal to 0.5 were reviews with positive sentiment, negative otherwise - showcased the same results in the evaluation metrics as prior to compression, i.e when the scores were on integer scale. The ROC-AUC Curve for when the connotation scores are compressed fractions (image below).

To end with, when a user tries to search a product in a particular category, based on the determined sentiments-the average of the connotation scores of positive reviews is calculated for each unique item and top-n items with highest average scores are recommended.

Note: average of the connotation scores is taken into account and not just the count of total positive reviews because we require the degree of extent to which the product was scrutinized by the public and not just the count of users that gave it a review

Below is the pictorial representation of the workflow for the whole project:

Sentimental Recommendation System

Unsupervised Sentimental Learning, Opinion Mining, Word2Vec, Review Contents

Preprocessing

Word Embeddings, Clustering and Connotation Scores:

Written by Arifwaghbakriwala