Truncated Singular Value Decomposition (SVD) using Amazon Food Reviews

Published in

The Startup

14 min readJul 9, 2019

Introduction

Matrix Factorization, involves describing a given matrix using its constituent elements. The most widely used matrix factorization method is the Singular-Value Decomposition also known as SVD. All matrices have an SVD, which makes it more stable than other methods, such as the eigendecomposition. It is often used in a wide array of applications including dimensionality reduction, de-noising and compression

In this article, you will discover the practical use of Truncated Singular-Value Decomposition method on a real world dataset. The dataset we would use is Amazon Food review.

The flow of this article is as follows.

Understanding of the Real world Dataset.
Importing the dataset into pandas dataframe
Exploratory Data Analysis on the Dataset
Data Preprocessing
Featurization
Brief Theory about SVD and Truncated SVD
Application of TruncatedSVD on the Dataset

Understanding the Dataset

The dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unqiue identifier for the user
ProfileName
HelpfulnessNumerator — number of users who found the review helpful
HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
Score — rating between 1 and 5
Time — timestamp for the review
Summary — brief summary of the review
Text — text of the review

Importing the Dataset

The dataset is available in two forms:

.csv file
SQLite Database

In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualize the data efficiently.

Since we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation will be set to “positive”. Otherwise, it will be set to “negative”.

# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite')# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data pointsfiltered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 100000  """, con)# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
    if x < 3:
        return 0
    return 1#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

In the above code, we read the dataset in the sqlite format, where the score is not equal to 3 using sql query into pandas dataframe. Once read, we replace the score column of the dataframe with negative if the score is less than 3 else positive.

Exploratory Data Analysis

After reading the data into pandas dataframe, next we performed EDA on the dataset. First we checked for any duplicate entries in the dataset. It was observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

From the above table, it can be seen that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8) ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on.

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavor or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delete the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product. De-duplication without sorting would lead to possibility of different representatives still existing for the same product.

#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

Code shown above, first sorts the data according to the ProductID and then drops all the duplicate copies while keeping the first copy.

It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calculations.

display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)display.head()

Code shown below, removes all the rows where HelpfulnessNumerator is greater than HelpfulnessDenominator.

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In the next section we will do pre — processing of the dataset.

Data Preprocessing

After finishing with Exploratory Data Analysis, next we performed preprocessing on the data.

In the Preprocessing phase we did the following as shown below:-

Removing the html tags
Remove any punctuation or limited set of special characters like , or . or # etc.
Check if the word is made up of English letters and is not alpha-numeric
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
Convert the word to lowercase
Remove Stopwords
Finally Snowball Stemming the word (it was observed to be better than Porter Stemming)

After which we collect the words used to describe positive and negative reviews.

When we deal with text problem in Natural Language Processing, stop words removal process is a one of the important step to have a better input for any models. Stop words means that it is a very common words in a language (e.g. a, an, the in English. It does not help on most of NLP problem such as semantic analysis, classification etc. Hence we will remove all the stop words in our reviews.

In the code below we define a set of all stop words in English.

# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st stepstopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

The code mentioned below performs all the steps mentioned above.

# https://stackoverflow.com/a/47091490/4084039
import redef decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)# general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentence in tqdm(final['Text'].values):sentence = re.sub(r"http\S+", "", sentence) # regular expression to remove http or https from reviews
    
    sentence = BeautifulSoup(sentence, 'lxml').get_text() # removal of html tags using BeautifulSoup library
    
    sentence = decontracted(sentence) #  Remove any punctuations or limited set of special characters like , or . or # etc.
    
    sentence = re.sub("\S*\d\S*", "", sentence).strip() #remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
    
    sentence = re.sub('[^A-Za-z]+', ' ', sentence) # regular expression to check whether text is english or alphanumeric
    # https://gist.github.com/sebleier/554280
    
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords) # converting words to lowercase, removing words with length less than equal to two and removing stopwords.
    preprocessed_reviews.append(sentence.strip()) # appending the processed reviews into a list.

After preprocessing of text review, next step is featurization.

Featurization

Based on the text reviews we can predict the sentiment of the user. But these text reviews are in the form of string, so first we need to convert these string features into numerical features. To convert string data into numerical data we can use following methods

Bag of Words
Term Frequency — Inverse Document Frequency (TFIDF)
Word2Vec

Inorder to show the use of Truncated SVD we have used TFIDF method. One can try with other methods, procedure will remain the same. Since we have used TFIDF, we will dig a bit into the working of TFIDF

TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells us the importance of the word in the corpus or dataset. TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF).

Term Frequency

Term Frequency is defined as how frequently the word appears in the document or corpus. Since each sentence is not of the same length, hence it may be possible that a word in long sentence may occur more number of times as compared to a word in a shorter sentence. Term frequency is defined as:

Inverse Document Frequency

Inverse Document frequency is another concept which is used for finding out importance of the word. It is based on the fact that less frequent words are more informative and important. IDF is represented by formula:

See below for a simple example to understand how TFIDF works.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

The above example is taken from here.

TFIDF can be easily implemented by using TFIDFVectorizer class in sklearn. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and will allow you to encode new documents.

Below is the code we used for getting the TFIDF vectors for the text reviews using the TfidfVectorizer.

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), min_df=10,use_idf= True)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)

In TfidfVectorizer, ngram_range = (1,1), indicates in only unigram words for getting the vectors.

When building the vocabulary we can ignore words that have a document frequency strictly lower than the given threshold. By setting min_df = 10, indicates that words having document frequency less that 10 are ignored.

Use_idf = True, Enable inverse-document-frequency reweighting.

After initializing the TfidfVectorizer, we fitted and transformed the preprocessed reviews that we obtained in the Preprocessing stage as shown in the code above.

Below are the few feature names we obtained from TfidfVectorizer.

some sample features(unique words in the corpus) ['aa', 'aafco', 'aback', 'abandon', 'abandoned', 'abdominal', 'ability', 'able', 'abroad', 'absence']

The shape of the our text review after doing TfidfVectorier was (87773, 11524). So we had 87773 text reviews in the dataset and each review is vectorized having dimension of 11524.

Hughes Phenomenon shows that as the number of features increases, the classifier’s performance increases as well until we reach the optimal number of features. Adding more features based on the same size as the training set will then degrade the classifier’s performance.

Inorder to have an optimal classifier performance, we thought of reducing the number of features in our text reviews.

In the next section, we will discuss how we achieved dimensionality reduction in our case.

SVD and Truncated SVD

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

A = U . Sigma . V^T

Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.

“The Singular Value Decomposition is a highlight of linear algebra.”

— Page 371, Introduction to Linear Algebra, Fifth Edition, 2016.

The diagonal values in the Sigma matrix are known as the singular values of the original matrix A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V are called the right-singular vectors of A.

SVD for Dimensionality Reduction

A popular application of SVD is for dimensionality reduction. Data with a large number of features, such as more features (columns) than observations (rows) may be reduced to a smaller subset of features that are most relevant to the prediction problem. The result is a matrix with a lower rank that is said to approximate the original matrix.

To do this we can perform an SVD operation on the original data and select the top k largest singular values in Sigma. These columns can be selected from Sigma and the rows selected from V^T. An approximate B of the original vector A can then be reconstructed.

B = U *Sigmak *V^Tk

In natural language processing, SVD can be applied on matrices of word occurances.

Truncated SVD

Till now, we saw that singular value decomposition breaks any matrix A down so that A = U*Sigmal*V^T.

Let’s take a closer look at the matrix Sigma.

Remember Sigma is a matrix of the form as shown below

where

are the singular values of the matrix A with rank r.

We can find truncated SVD to A by setting all but the first k largest singular values equal to zero and using only the first k columns of U and V.

Choosing the value of k

So what k should we should choose? If we choose a higher k, we get a closer approximation to A. On the other hand, choosing a smaller k will save our more processing time.Basically, we must choose between time and accuracy. Which do you think is more important? How can we find a good balance between sacrificing time and sacrificing data?

In next section we see how to use Truncated SVD for dimensionality reduction for our text reviews which has a dimension of 11524.

Dimensionality Reduction using Truncated SVD

First step is to get the Words co-occurrence matrix. Words co-occurrence matrix describes how words occur together that in turn captures the relationships between words. Words co-occurrence matrix is computed simply by counting how two or more words occur together in a given corpus.

As an example of words co-occurrence, consider a corpus consisting of the following documents:

penny wise and pound foolish

a penny saved is a penny earned

Letting count(w(next)|w(current)) represent how many times word w(next) follows the word w(current), we can summarize co-occurrence statistics for words “a” and “penny” as:

The above table shows that “a” is followed twice by “penny” while words “earned”, “saved”, and “wise” each follows “penny” once in our corpus. Thus, “earned” is one out of three times probable to appear after “penny.” The count shown above is called bigram frequency; it looks into only the next word from a current word. Given a corpus of N words, we need a table of size N x N to represent bigram frequencies of all possible word-pairs. Such a table is highly sparse as most frequencies are equal to zero.

Below is the code of generating the matrix co-occurrence for our text review. We obtained the total number of unique words in our corpus from TfidfVectorizer using the method .get_feature_names()

# to obtain the co-occureence matrix using Top 6000 features of TFIDF vectorizer
cooccurrenceMatrix = np.zeros((6000,6000)) # co-occurance matrix
context_window = 4 # context window for co-occurance matrix

Once our co-occurance matrix is constructed, next we applied Truncated SVD for dimensionality reduction.

As we saw, for Truncated SVD we need to decide upon the value of k. We dedcide the value of k or the number of components based on the explained variance by each of the singular values.

Analysis of this is shown below in the code.

# Program to find the optimal number of components for Truncated SVD
n_comp = [4,10,15,20,50,100,150,200,500,700,800,900,1000,1500,2000,2500,3000,3500] # list containing different values of components
explained = [] # explained variance ratio for each component of Truncated SVD
for x in n_comp:
    svd = TruncatedSVD(n_components=x)
    svd.fit(cooccurrenceMatrix)
    explained.append(svd.explained_variance_ratio_.sum())
    print("Number of components = %r and explained variance = %r"%(x,svd.explained_variance_ratio_.sum()))plt.plot(n_comp, explained)
plt.xlabel('Number of components')
plt.ylabel("Explained Variance")
plt.title("Plot of Number of components v/s explained variance")
plt.show()

Plot for selecting the optimal k or optimal number of components for Truncated SVD

From the above graph, we observe that with k = 3000 we get an explained variance above 97% . Hence 3000 features or words explain 97% of our data.

Therefore each review instead of representing it with 11k old features we can represent each review with top 3000 components obtained from the truncated SVD.

This is how we can implement dimensionality reduction using Truncated SVD.

Once we decide upon the features, we can implement various classical machine learning algorithm to predict whether a given review by a customer about a product is positive or negative.

Conclusion

In order to have an optimal classifier performance, we need to ensure that our model should not have large dimensions at the same time we also need to take that the dimension is not too small. Classifier with small feature sets tend to be overfitted.

We learnt and implemented practically dimensionality reduction technique for text data using Truncated SVD and saw how to optimally decide upon the number of components for Truncated SVD.

Refer my github profile for detailed code.