Amazon Fine Food Reviews: Collaborative Filtering

Creating a recommendation system using a collaborative-based filtering method using Surprise library in Python.

Published in

Web Mining [IS688, Spring 2021]

9 min readMay 4, 2021

Amazon is an international e-commerce giant with a current net worth of 314.9 Billion USD and a massive portfolio of products offered online, ranging from electronics, cosmetics, homes, to fresh produce. Amazon reviews are often the most publicly visible reviews of consumer products. Online reviews influence customer’s buying decisions. A positive review has a high overall high chance of customers buying it again or making other customers buy it.

As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information to be a smarter consumer and reviewer and understand the behavior, likes, and dislikes towards any product.

I have been lately buying a lot of food products from Amazon due to COVID and lockdown conditions and recently came across Amazon 500,000 fine food review dataset which inspired me to visualize this huge dataset using some web mining and machine learning skills and create a recommendation system. Both customers, as well as sellers, are going to be benefited from the study.

From the seller's point of view, sellers can understand the likelihood of any product which is more liked or purchased by customers and thus they can increase the quantity of the product and also see if a customer buys item A ,how likely is that he will buy item B. From the customer’s point of view, he/she can figure out which other items they would prefer to buy as the system would suggest them relative products.

My interest in large set data, human behavior, and food love led me to choose Amazon 500,000 fine food review dataset from Stanford Network Analysis Project (SNAP) to draw conclusions using the collaborative filtering method and machine learning algorithms and create a recommendation system using collaborative filtering method.

Before diving further, let us understand what is a Recommender System.

“A recommender system, or a recommendation system (sometimes replacing ‘system’ with a synonym such as a platform or engine), is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.”

There are mainly two types of recommender systems Content-based and Collaborative-based. As the data contains User ratings, I have used the collaborative-filtering method.

“Collaborative-based filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely to agree again in the future.”

Dataset

The CSV data file was collected from the Kaggle website. This dataset contains reviews of fine foods from amazon. The data spans over a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Link to the Dataset: https://www.kaggle.com/snap/amazon-fine-food-reviews

These are the statistics of the original dataset:

Dataset has over 500,000 rows and 10 columns dating from 1999 to 2012. It contains information regarding User Id, ProductId, helpfulness of review, review score (1–5 starts), time in Unix format, and review text.

Attribute information:

Id
ProductId — unique identifier for the product
UserId — unique identifier for the user
ProfileName
Helpfulness Numerator — number of users who found the review helpful
HelpfullnessDenominator — number of users who indicated whether they found the review helpful or not
Score — a rating between 1 and 5
Time — timestamp for the review
Summary — Summary of the review
Text — Text of the review

Data Visualization

The first bar count plot shows the count of each rating. It is clearly visible that many users have rated 5 as it is shown with the huge spike. This is followed by a rating of 4 which also has a high number. We can see that this is imbalanced dataset as ratings 1, 2, 3 are comparatively less

The second bar count plot shows various products and their ratings which has reviews of more than or equal to 600.

Libraries and Packages

I have used the following packages and libraries for the programming in Python 3.

#Libraries and packagesimport warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns#Surprise libraryfrom surprise import accuracy
from surprise.model_selection.validation import cross_validate
from surprise.dataset import Dataset
from surprise.reader import Reader
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise.model_selection import RandomizedSearchCVfrom collections import defaultdict#For uploading files to google collab
from google.colab import files

Pandas and NumPy are used for data preprocessing and basic linear algebra. Seaborn and Matplotlib helped in creating visual graphics and bar plots for the dataset. As I work in Google collab, the files package is used to import input data files.

Using the surprise library from sci-kit will help us in building and analyzing recommender systems. It provides many prediction algorithms like SVD and similarity measures. I have used the SVD algorithm that is equivalent to probabilistic matrix factorization it allows us to discover the features underlying the interactions between users and items.

The collection library implements specialized container datatypes providing alternatives to Python’s general-purpose built-in containers, dict, list, set, and tuple.

Data Preprocessing

Even though data has missing values in attributes Profilename and Summary we won't be needing these columns. For the collaborative-based filtering method using the SVD algorithm only UserID, ProductID and Ratings are required. To simplify the data preprocessing I followed the following 3 steps.

Step 1: Drop the columns: Id, ProductId, ProfileName, Helpfulness Numerator, HelpfullnessDenominator, Time, Summary, Text

Step 2: Change the position of the UserID column and set a position 0. The Surprise reader reads the data in the data frame in the following order user (raw) ids, the item(raw) ids, and the ratings. So we need to change the position of the UserID column.

Step 3: Filter the users who have given 50 or more ratings. I have done this to improve the RMSE and MSE values and the complete dataset takes a longer time to evaluate and fit.

#drop columns
reviews.drop('Id',axis=1, inplace=True)
reviews.drop('Time', axis=1, inplace=True)
reviews.drop('Summary',axis=1, inplace=True)
reviews.drop('Text', axis=1, inplace=True)
reviews.drop('ProfileName', axis=1, inplace=True)
reviews.drop('HelpfulnessNumerator', axis=1, inplace=True)
reviews.drop('HelpfulnessDenominator', axis=1, inplace=True)
#change UserID position
first_column = reviews.pop('UserId')
reviews.insert(0, 'UserId', first_column)#filter the dataset
reviews_groupby_users_Ratings = reviews.groupby('UserId')['Score']reviews_groupby_users_Ratings = pd.DataFrame(reviews_groupby_users_Ratings.count())user_list_min50_ratings = reviews_groupby_users_Ratings[reviews_groupby_users_Ratings['Score'] >= 50].indexreviews =  reviews[reviews['UserId'].isin(user_list_min50_ratings)]

The current shape of the dataset:

The total number of rows : 22941
The total number of columns : 3

Now we are ready for running the recommender system.

Collaborative Filtering Recommender Model using SURPRISE Library

Read the data into surprise reader and split the dataset into training and testing sets. I have split the data into training and test set in the ratio of 70:30 respectively.

reader = Reader()surprise_data = Dataset.load_from_df(reviews, reader)trainset, testset = train_test_split(surprise_data, test_size=.3, random_state=42)

The objective is to recommend the top 5 products to a specific user.

First, I have created a function that recommends the products to a given UserID by mapping the predictions to each user and then sort the predictions for each user and retrieve the 5 highest ones.

Then I created a class which fits that training dataset, performs cross validation over testing dataset and recommends the products to the user.

Surprise package provides accuracy modules from which I have used RMSE and MSE as measures.

RMSE or Root mean square error is dimensionality reduction technique to check how close is product of UV is close to matrix M.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

MSE or Mean square error is the average squared difference between the estimated values and the actual value.

To evaluate the performance of the model and check whether the model is Under-fitting/Over-fitting. I have used Cross validation (CV) technique to test the effectiveness of model, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample of the data on which is not used to train the model, later use this sample for testing/validating. Surprise library provides cross validation function. I have used K-fold cross validate with 5 folds and measure RMSE and MAE.

Moving further,

For collaborative-based recommendations, I have used SVD.

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts to make certain subsequent matrix calculations simpler. It provides another way to factorize a matrix, into singular vectors and singular values.

A=𝑈.𝑆𝑖𝑔𝑚𝑎.𝑉𝑇

To find the optimal model, RandomisedSearchCV is used. The RandomizedSearchCV class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. As opposed to GridSearchCV, which uses an exhaustive combinatorial approach, RandomizedSearchCV samples randomly from the parameter space. This is useful for finding the best set of parameters for a prediction algorithm, especially using a coarse to fine approach.

Output:

The optimal model was chosen from where parameters where n_epochs = 20, lr_all=0.005 and reg_all=0.4.

Results and Predictions:

Step1: Initialise the model

svd = clf.best_estimator[‘rmse’]col_fil_svd = collab_filtering_based_recommender_model(svd, trainset, testset, surprise_data)

Step2: Fit and Predict

After fitting the training data, the RMSE is 0.97 and MSE is 0.94.

Step3 : Cross Validate

These are the results of cross validation.

Step4: Recommend

For UserID : AQLL2R1PPR46X

For UserID: A2GEZJHBV92EVR

For UserID: A1IU7S4HCK1XK0

SVD (Singular Value Decomposition) model has an test RMSE value of 0.97 and cross validation RMSE value of 0.95. Using SVD model we have a reduced RMSE value.

As for the recommendations, each user will have different products recommended to them as they are infered by filling out missing entries in the matrix during matrix factorization using SVD.

Limitations

Few limitations I faced while doing this study.

While I was trying to fit the whole dataset, it was taking longer duration to fit and train data and the RMSE value was quite high.
Collaborative filtering can lead to some problems like cold start for new items that are added to the list. Until someone rates them, they don’t get recommended.
Data sparsity can affect the quality of user-based recommenders.
The dataset is imbalanced as most of the ratings are 5 which means system will recommend only high rated products.

Conclusion

Overall the recommendation system performed well on the Amazon dataset with RMSE 0.97 which is pretty good result for almost no feature engineering. The recommended products has their rating greater than or equal to 4.5 which means only the high rated products are being recommended but it is not always right as a person can be interested in low rated product. SVD can be very slow and computationally expensive. We cannot solely rely on the recommendation as collaborative system deals with low accuracy problems and it only works based on the user ratings.

To learn different type of recommender systems, this dataset is a great start. You can create KNN based recommender or Popularity based recommender.

I hope you enjoyed the article.