E-Commerce Recommendation Engine with Collaborative Filtering

Divya Chandana
The AI Guide
Published in
9 min readMay 6, 2021

Introduction

Online E-commerce companies use various recommendation engines to recommend a variety of suggestions to various types of clients. These E-companies generally use collaborative filtering, which scales to enormous datasets and produces high-quality suggestions. This sort of filtering is based on the user’s purchased and rated products, based on the past data this model provides a suggestions list for its customer. In this medium blog, we are building a collaborative based filtering technique for the electronics items dataset.

By using this recommendation system we can help e-commerce users who are looking for similar products to purchase, it will be very helpful for them as well as e-commerce companies who are looking to provide the best recommendation system to the website.

Collaborative Filtering

For each user, collaborative recommender systems recommends items based on how, similar users liked the item.

Collaborative filtering is based on collecting and analyzing data on user’s behaviors, their usual activities, ratings, and anticipating what they will like based on the similarity with other customers.

A key advantage of this approach is that it does not depend on every detail and hence it can precisely recommend complex products such as items without requiring an “in-detail” of the products.[1]

Collaborative filtering is based on the fact that individuals who liked the past will agree with it in the future. For instance, in case an individual “A” likes products 1, 2, 3, and “B” likes 2,3,4 at that point they have similar interests and “A” ought to like thing 4 and “B” ought to like thing 1.

This system matches persons with similar interests and provides recommendations based on this matching.

Our objective is to build a recommendation engine to suggest similar items to clients based on their past ratings for other items. For this reason, to begin with, we are going perform Exploratory Data Analysis [EDA] and after that implement recommendation algorithms including Collaborative and discussion on Hybrid Recommenders. [2]

Libraries

Numpy is a Python library used for working with arrays. It also has functions for working in domain of Linear algebra, Fourier transform, and matrices.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Suprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.

Data Collection

To design a collaborative system I need the dataset with UserId, ProductId, and rating. I found a data set from Kaggle, this dataset contains electronics product details which will very useful for my analysis. Here is the link electronics dataset

The Electronics dataset includes:

  • userId : user ‘s with a unique id
  • productId : unique product id
  • Rating : Rating given by the user for the product
  • timestamp : Time of the rating [not used for this analysis]

Data Cleaning

I cleaned up the Null values in the dataset using the dropna method provided by pandas. I used only 20,000 records from the dataset since the data is very huge and it took time to do analysis.

This is an overview of how data looks like.

info() methods are very useful as they provide an overview of the data like, the number of records present in the data, number of columns and data type of column. It gives an overview of what kind of data I’m dealing with.

describe() function generates descriptive statistics including those that summarize the central tendency, dispersion, and shape of a dataset’s distributions.

From the results of describe() method, I found that the rating column is cleaned properly as we don’t have any negative values, null values, and in this case, we don’t have to work on Normalizing the data as it always ranges from 1 to 5.

And also timestamp here is of no use. Therefore, I dropped down this column.

Exploratory Data Analysis

I did extensive analysis on the electronic dataset to understand the data first by plotting major factors into graphs.

Here, we start analyzing the number of ratings vs ratings [1,2,3,4,5 stars]

More than 5000 users gave the rating for the products as 5 stars.

For the next graph, we will analyze the distribution of the number of ratings and mean ratings recorded for each product.

The below histograms shows that most of the ratings are between 0 and 200, and most of the products have a mean rating of 5.

To observe the relationship between the Number of Ratings and Mean Rating I created a scatterplot. If we invert the graph we can easily see it as a right-skewed graph. Most of the users rated the products in the range of 4-5 stars.

SVD

Surprise library is a Python scikit for building and analyzing recommender systems that deals with rating information. Here we utilize the Surprise library that uses amazingly effective algorithms like Singular Value Decomposition (SVD) to minimize Root Mean Square Error (RMSE) that is measured by K fold Cross Validation and give great recommendations.

It is a well known strategy in linear algebra for matrix factorization in machine learning. SVD constructs a matrix with the row of users and columns of items and the elements are given by the users’ ratings

We are going to use the Surprise library, a Python library for simple recommendation systems

The Dataset method allows us to easily load and store the electronic data consisting of 20k data in a user with product ratings interaction matrix. The rows of this matrix represent users, and the columns represent products.

After loading the data using Dataset.load_from_df, we need to split the data with 70:30 Train Test ratio.

For our task, we want to use the cosine, Pearson etc. similarity between products to make new recommendations.

Matrix Factorization

Matrix factorization is a sub section of collaborative filtering. Matrix factorization work by reducing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. Advantage of utilizing this approach is that rather than having a multi dimensional matrix with huge number of missing values, now we will be having smaller matrix in lower-dimensional space. [3]
There are several advantages with this method. It handles the sparsity of the original matrix better than memory based ones. Also comparing similarity on the resulting matrix is much more reliable especially in dealing with large sparse datasets.

We will start loading the dataset using Dataset.load_from_df.

Observe the K Fold Cross validation result values MSE [Mean Squared Error], RMSE [Root Mean Squared Error]

We get a mean Root Mean Squared Error of 1.3 approx. which is good enough for our case.

As an example, we use the algorithm to predict the rating that might be given to the productId of ‘0132793040’ by this specific userId AKM1MP6P0OYPR of 5 star.

Evaluation Metrics

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.[4]

Here for the train data we got RMSE 1.3620 which is good actually. What we are expecting is not so much deviant from the original.

We are fitting the train data and predicting the test data using .fit_and_predict() method. I used Cross Validation to get accurate values, using this I got RMSE [Root Mean Squared Error] as 1.366 which is not too much deviated form the normal RMSE.

After Building the recommendation engine, I want to test it with 3 Query users, where I’m providing userid and the number of recommendations to be printed to the customer function, recommend which results in the list of all the products and their respective ratings for that particular user.
The below-mentioned image shows how to access the recommend function.

Below are the top 5 product recommendations list for the User ANTN61S4L7WG9

Recommendations for User 1

Below are the top 5 product recommendations list for the User AYNAH993VDECT

Recommendations for User 2

Below are the top 5 product recommendations list for the User A18YMFFW974QS

Recommendations for User 3

Observations

SVD (Singular Value Decomposition) model has a test RMSE score of 1.362 and cross validation (CV) RMSE score of 1.366. Using this model we have a reduced RMSE score compared to KNN which is 1.41

AS for the recommending, each client will have a variety of products suggested to them as they are gathered filling out missing entries in the matrix during matrix factorization using SVD.

SVD is a superior model compared to KNN with a higher RMSE score of 1.36. This is more useful when the data is sparse with many missing ratings.

Bugs encountered

The dataset I initially worked with contained nearly 7824482 records. While loading this data into dataframe it use to throw me memory exceeded error. To fix this issue I changed the code

from df = pd.read_csv(myfile,sep='\t') # memory error

to df = pd.read_csv(myfile,sep='\t',low_memory=False)

I fixed the error and also I reduced records to 20K because as less data takes less time for analysis.

Limitations

Collaborative Filtering gives solid recommendation systems, and at the same time requires fewer details than possible. However, it has a few limitations in some particular scenarios.

It has a cold start like when new item coming in, that cannot be recommended unless until it is rated or reviewed by the customer and also it doesn’t have any extra information of the product either to give any recommendations. Collaborative Filtering lacks transparency and explainability of this level of information.

To overcome these kinds of scenarios, I would recommend a Hybrid recommender which suggests based on content-based filtering and collaborative-based filtering recommendations which would be quite effective.

Conclusion

The recommendation system we developed will be very helpful to the customers and e-commerce companies to recommend products based on experience with the products. However we can improve this recommendation engine using Deep Learning Techniques like adding RNN’s, CNN’s, extra layers to train for much more accuracy. And also Deep Hybrid Models Based Recommendation, many neural building blocks can be integrated to formalize more powerful and expressive models.

References

[1]

[2]

[3]

[4]

--

--