Collaborative based Recommendation system Using SVD

Mansi Arora
Analytics Vidhya
Published in
8 min readMar 29, 2020

--

What do you mean by collaborative recommendation systems?

Collaborative filtering methods build a model based on users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in.

How it is different from Content based recommendation systems?

Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on product features.

what is Surprise library?

The name SurPRISE (roughly :) ) stands for Simple Python Recommendation System Engine.

Surprise is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

Data Gathering Step:

We took the data from the Kaggle website where we have 4 data files and one movie title file:

MovieIDs range from 1 to 17770 sequentially. CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. Ratings are on a five star (integral) scale from 1 to 5. Dates have the format YYYY-MM-DD.

Type of Machine learning problem:

For a given movie and user we need to predict the rating would be given by him/her to the movie. The given problem is a Recommendation problem
It can also seen as a Regression problem.

So,the performance metric is used to Minimize RMSE.

Exploratory Data Analysis (EDA):

We combined all the files together and formed the dataframe in the given format.

We then checked for Nan Values and Duplicates, but there weren’t any.

Basic statistics is applied on the data :

So,in total we had 480189 users and 17770 movies in the dataset.

We split the data into train and test for predicting the model in the ratio of 80:20

On the train data we found out the

Distribution of rating:

There are some movies which are rated by many and most of the ratings have got some hundreds of ratings.

As there data is very sparse we created train and test sparse matrix to save space.

Finding global average of all movie ratings:

Cold Start problem:

Cold start problem is that problem, where system is not able to recommend items to users. For every recommender system, its required to build user profile by considering her preferences and likes. User profile is developed by considering her activities and behaviors she perform with the system. On the basis of user previous history and activities system make decisions and recommend items consequently. The problem arises when a new user or new item enters the system, for such user/items system don’t have enough information to make a decision. For example, a new user has not rated some items and not yet visited/viewed some items then it would be difficult for the system to build a model on that basis. Cold start problem arisis in three different situation i.e. for new users, for new items and for new community.

how to tackle cold start problem?

There are multiple ways people handle Cold Start problem. Those are

1. You take the features of the movies based on its content and then evaluate the similar type of movies of the new user based on 2 to 3 movies he watched.

2. You recommend globally top movies initially to a new user who is a new user.

3. You try to show movies which are recently being popular from the region where your IP address is pointing to.

Here,15% of the users didn’t appear in train data .

And,1.96% movies are new in the test data.

Movie-movie similar Matrix:

Features in our dataset:

  • GAvg : Average rating of all the ratings
  • Similar users rating of this movie: sur1, sur2, sur3, sur4, sur5 ( top 5 similar users who rated that movie.. )
  • Similar movies rated by this user: smr1, smr2, smr3, smr4, smr5 ( top 5 similar movies rated by this movie.. )
  • UAvg : User’s Average rating
  • MAvg : Average rating of this movie
  • rating : Rating of this movie by this user.

Using Surprise library of python:

We can know more about Surprise by going through the documentation given by :

We have picked up SVD and SVD++ as good matrix factorization systems and XgBoost to build a single model using features .

So training of the model starts by creating a XGboost model with 13 features.

Model 1: Xgboost with 13 features:

TEST DATA for Xgboost with 13 Features:
------------------------------
RMSE : 1.0761851474385373
MAPE : 34.504887593204884

Model 2: Surprise baseline model

What are baseline models?

A baseline model is a simple model which we will use a “starting point” and compare rest of the models against it. Yes, it is important to try out various models and see which ones work well on this dataset as the cost of experimenting is not very high and we cannot be fully certain which model will work the best.

Test Data Surprise baseline model:
---------------
RMSE : 1.0730330260516174

MAPE : 35.04995544572911

Model 3: SVD Matrix Factorization User Movie interactions

Let be a rating matrix containing the ratings of users for items. Each matrix element refers to the rating of user for item . Given a lower dimension , MF factorizes the raw matrix into two latent factor matrices: one is the user-factor matrix and the other is the item-factor matrix . The factorization is done such that is approximated as the inner product of and (i.e., ), and each observed rating is approximated by (also called the predicted value). However, only captures the relationship between the user and the item . In the real world, the observed rating may be affected by the preference of the user or the characteristics of the item. In other words, the relationship between the user and the item can be replaced by the bias information. For instance, suppose one wants to predict the rating of the movie “Batman” by the user “Tom.” Now, the average rating of all movies on one website is 3.5, and Tom tends to give a rating that is 0.3 lower than the average because he is a critical man. The movie “Batman” is better than the average movie, so it tends to be rated 0.2 above the average. Therefore, considering the user and movie bias information, by performing the calculation , it is predicted that Tom will give the movie “Batman” a rating of 3.4. The user and item bias information can reflect the truth of the rating more objectively. SVD is a typical factorization technology (known as a baseline predictor in some works in the literature). Thus, the predicted rating is changed to where is the overall average rating and and indicate the observed deviations of user and item , respectively.

Test Data
---------------
RMSE : 1.0726046873826458

MAPE : 35.01953535988152

Model 4: SVD Matrix Factorization with implicit feedback from user ( user rated movies )

SVD++ model introduces the implicit feedback information based on SVD; that is, it adds a factor vector () for each item, and these item factors are used to describe the characteristics of the item, regardless of whether it has been evaluated. Then, the user’s factor matrix is modeled, so that a better user bias can be obtained.

Test Data
---------------
RMSE : 1.0728491944183447

MAPE : 35.03817913919887

Comparison between all the models:

  1. Xgboost with 13 features:1.0761851474385373
  2. Surprise baseline model:1.0730330260516174
  3. SVD Matrix Factorization User Movie interactions: 1.0726046873826458
  4. SVD Matrix Factorization with implicit feedback from user ( user rated movies ): 1.0728491944183447

So,the SVD matrix factorization base model produced the least RMSE.

Hence,we will use this model to predict our ratings for the test data.

--

--