Amazon Fine Food Reviews Featurization with Natural Language Processing

Published in

Analytics Vidhya

16 min readJul 25, 2020

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Introduction

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unqiue identifier for the user
ProfileName
Helpfulness Numerator — number of users who found the review helpful
HelpfullnessDenominator — number of users who indicated whether they found the review helpful or not
Score — rating between 1 and 5
Time — timestamp for the review
Summary — brief summary of the review
Text — text of the review

Objective

Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

[Q] How to determine if a review is positive or negative?

[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

Loading the data

The dataset is available in two forms

.csv file
SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualize the data efficiently.

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation will be set to “positive”. Otherwise, it will be set to “negative”.

I used a database of over 500,000 reviews of Amazon fine foods that is available via Kaggle and can be found here.

when I am loading the dataset using pandas I got this output in Jupiter Notebook .

But In output window you may wonder What is Helpfulness Numerator and Helpfulness Denominator?

Helpfulness Numerator: Number of Peoples who found the review helpful to them.
Helpfulness Denominator: Number of Peoples indicated whether they found the review helpful or not.

In Machine Learning Data Cleaning is Very important We want to preprocess the data using Pandas.

It is observed (as shown in the image below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.

As can be seen above the same user has multiple reviews of the with the same values for Helpfulness Numerator, HelpfullnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on

We can see that above two images brand is same for both products but flavor is different.

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delete the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

After we removing these duplicates we can check how much data is remaining in original data.

We can see that only 69.25 % data is remaining after removing the duplicates, we observed that 30.75% data is duplicated in our original data.

Also we can see that, below image, we can observed that in two rows given below the value of Helpfulness Numerator is greater than HelpfullnessDenominator which is not practically possible hence these two rows too are removed from calculations.

So we want to remove the data which row contain Helpfulness Numerator is greater than HelpfullnessDenominator.

What is final[‘score’].value_counts() it is in our given data set 307061 points are Positive and 57110 points are Negative.

But here is the Problem we can see that in our dataset we have columns of text and summary. These are text features but in machine learning we want to use only numerical features for building model. Now, the question is how to convert Text features into numerical vectors?

But you may think why we want to convert text to numerical vectors?

Consider if our text features are converted to d-dimensional numerical vectors, than we can draw the normal to plane to separate positive reviews and negative reviews ,consider below image the blue crosses are positive reviews and red crosses are negative reviews separated by plane W.

If assume some one given a new review to classification than we converted in to numerical vector and multiply with weights, than if result is > 0 we consider as positive review and if result is < 0 we consider as negative review .

Because from Linear algebra all points are in the same direction of the normal to the plane considered as Positive and opposite direction of the normal to the plane is consider as negative.

So we want to convert text features to numerical vectors. but question is how can convert in to numerical vector?

Answer is Natural Language Processing: we use the some of the techniques of Natural Language Processing to convert text to numerical vectors.
Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization. …

what is the relation between the text and numerical vectors?

consider if two reviews are very similar ,then the distance between those two vector is small ,i.e similar points are must be closer.
If two reviews vector distance is large than the reviews are dissimilar.
If the two reviews r1,r2 are more similar then vector representation of reviews v1 and v2 are must be close.
similar text must be close geometrically.
we want to find the method takes text as input and given back numerical vector as output such that ,similar text must be close geometrically.

we use some of these techniques to find the vector representation of given text :
Bag of Words
tf–idf
Word2vec
Average Word2vec
Average tf–idf Word2vec

In Natural Language Processing text is called as a Document and Collection of documents is called Corpus.

Before we dive in to techniques of NLP first we do some preprocess for given text data. we use text preprocessing because to reduce the size of the given text data, these are increase the length of the text data.

Text Preprocessing:

Begin by removing the html tags.
Remove any punctuations or limited set of special characters like , or . or # etc.
Check if the word is made up of English letters and is not alpha-numeric.
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters).
Convert the word to lowercase.
Remove Stop words(Stop words considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc).
Finally Snowball Stemming the word (Stemming is the process of taking the related words and convert them in to base form Stemming helps in the reduction of the words to a compact form called word stem). Ex: taste ,tasty ,tasteful for these words base form is tast ,after stemming .

Text preprocessing of all the steps in a python given below:

After Preprocessing the text it looks like below image and store it for future use ,similarly we can do this for Summary also.

Text Featurization Techniques:

BOW(Bag of words):

Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.

BOW is an approach widely used with:

Natural language processing.
Information retrieval from documents.
Document classifications.

On a high level, it involves the following steps.

What is a word vector?

At one level, it’s simply a vector of weights. In a simple 1-of-N (or ‘one-hot’) encoding every element in the vector is associated with a word in the vocabulary. The encoding of a given word is simply the vector in which the corresponding element is set to one, and all other elements are zero.

Suppose our vocabulary has only five words: King, Queen, Man, Woman, and Child. We could encode the word ‘Queen’ as:

Tokenization:

Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

Sentence Tokenization

Sentence tokenizer breaks text paragraph into sentences.

Word Tokenization

Word tokenizer breaks text paragraph into words.

Feature Generation using Bag of Words

In the Text Classification Problem, we have a set of texts and their respective labels. But we directly can’t use text for our model. You need to convert these text into some numbers or vectors of numbers.

Bag-of-words model(BoW ) is the simplest way of extracting features from the text. BoW converts text into the matrix of occurrence of words within a document. This model concerns about whether given words occurred or not in the document.

Example: Assume there are three documents :

Doc 1: I love dogs.

Doc 2: I hate dogs and knitting.

Doc 3: Knitting is my hobby and passion.

Now, you can create a matrix of document and words by counting the occurrence of words in the given document. create the document vocabulary that including all the words in three documents and extract the words from the sentences.

Count and put how many times the word will existing in the vector otherwise put 0. If the dimension of our vocabulary size is large then our vector v1 is sparse vector means most of the elements in the vectors are zero.

As you can see, each sentence was compared with our word list generated . Based on the comparison, the vector element value may be incremented. These vectors can be used in ML algorithms for document classification and predictions.

example: Doc1-Doc2=||Doc1-Doc2||=sqrt(0+1+0+1+1+1+0+0+0+0)

Doc1-Doc2=sqrt(4)=2

Binary Bag of words(BoW): In Binary BOW we consider instead of number of times occurrence of word, we consider only weather the word is exit or not. If the word is exist put 1 else assign 0 to the vector.

This matrix is using a single word. It can be a combination of two or more words, which is called a bigram or trigram model and the general approach is called the n-gram model.

Insights into bag of words:

The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

This gives the insight that similar documents will have word counts similar to each other. In other words, the more similar the words in two documents, the more similar the documents can be.

Limitations of BOW

Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.
Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

Implementation of Bow on Amazon food reviews dataset in sci-kit learn

You do not have to code BOW whenever you need it. It is already part of many available frameworks like CountVectorizer in sci-kit learn.

TF-IDF(Term frequency -inverse document frequency):

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

Term Frequency (tf): Term Frequency is the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

Term Frequency gives us how often does word ‘w’ occurs in document d. If More often word ‘w’ tf will increases.

Inverse Data Frequency (idf): This measures the importance of document in whole set of corpus, this is very similar to tf. The only difference is that tf is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N.

idf gives us the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high idf score.

If we already computed the tf value and if this produces a vectorized form of the document, why not use just tf to find the relevance between documents? why do we need idf?

Let me explain, though we calculated the tf value, still there are few problems, for example, words which are the most common words such as “is, are” will have very high values, giving those words a very high importance. But using these words to compute the relevance produces bad results. These kind of common words are called stop-words, although we will remove the stop words later in the preprocessing step, finding the importance of the word across all the documents and normalizing using that value represents the documents much better.

Finally, by taking a multiplicative value of tf and idf, we get the tf-idf score, there are many different variations of tf-idf but for now let us concentrate on the this basic version.

Let’s take an example to get a clearer understanding.

Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate document.

We will now calculate the tf-idf for the above two documents, which represent our corpus.

From the above table, we can see that tf-idf of common words was zero, which shows they are not significant. On the other hand, the tf-idf of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.

In tf-idf we give more importance to:

rare words occurs in our corpus.
more frequent words occurs in our document.

Limitations of tf-idf

Semantic meaning: It also not consider the Semantic meaning of the word .It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

Implementation of tf-idf on Amazon food reviews dataset in sci-kit learn

The top 10 features based on the score is given by

word2vec:

This technique is the state of the art algorithm, it consider the semantic meaning of the word.

If we give the word it converts in to vectors. It also learns relationship automatically from the text.

The output of the word2vec model is Dense vectors.Word2vec model requires large text corpus.

In word2vec, a distributed representation of a word is used. Take a vector with several hundred dimensions (say 1000).Each word is represented by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

If I label the dimensions in a hypothetical word vector (there are no such pre-assigned labels in the algorithm of course), it might look a bit like this:

Such a vector comes to represent in some abstract way the ‘meaning’ of a word. And as we’ll see next, simply by examining a large corpus it’s possible to learn word vectors that are able to capture the relationships between words in a surprisingly expressive way. We can also use the vectors as inputs to a neural network.

Vectors for King, Man, Queen, & Woman:

The result of the vector composition King — Man + Woman = ?

To understand the complete working of word2vec model visit here.

Implementation of word2vec on Amazon food reviews dataset in sci-kit learn

After training the model the similarity output look like below image:

The simple similarity of words also given by:

Average word2vec:

we can train our model using Word2Vec and then use the sentence vectors. Average of Word2Vec vectors .We can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.

We need to give large text corpus where for every word it creates a vector. it tries to learn the relationship between vector automatically from raw text. larger the dimension it has, larger it is rich in information the vector is.

Implementation of Average word2vec on Amazon food reviews dataset in sci-kit learn

tf-idf weighted Word2vec:

In this method first, we will calculate tf-idf value of each word, then we multiply with the word2vec value of each word and take mean of it.

Implementation of tf-idf weighted word2vec on Amazon food reviews dataset in sci-kit learn

First, we will calculate tf-idf value of each word.

Then we calculate the word2vec of each word and multiply with the tf-idf of the word value and take mean of it.

This is the small introduction to Text Featurization with Natural Language Processing by using real-world data set Amazon Food Reviews from Kaggle.

For the complete code visit My GitHub Repository click here

References

Applied AI
Coursera
Data Camp

Thanks for reading and for your patience. Let me know if there any errors in the post. Let’s discuss in comments if you find anything wrong in the post or if you have anything to add..