MOVIE RECOMMENDER SYSTEM USING CONTENT-BASED AND COLLABORATIVE FILTERING

Shruti Bendale

Published in

Analytics Vidhya

5 min readMar 26, 2020

Created a movie recommender system using collaborative filtering and content-based filtering approaches.
Compared the results of all the approaches by calculating the RMSE values

INTRODUCTION

Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyse patterns of user interest in products to provide personalized recommendations that suit a user’s taste.

RECOMMENDER SYSTEM APPROACHES

Machine learning algorithms in recommender systems are typically classified into two categories:

Collaborative filtering
Content-based filtering

COLLABORATIVE FILTERING:

This approach is based on the past interactions between users and the target items.
The input to a collaborative filtering system will be all historical data of user interactions with target items.
This data is typically stored in a matrix where the rows are the users, and the columns are the items.

Source: https://medium.com/@CVxTz/movie-recommender-model-in-keras-e1d015a0f513

CONTENT-BASED FILTERING:

Works with data that the user provides, either explicitly (rating) or implicitly (clicking on a link).
Based on that data, a user profile is generated, which is then used to make suggestions to the user.

MOVIELENS DATASET:

20 million ratings
465,000 tags applications applied to 27,000 movies by 138,000 users

PERFORMANCE METRICS:

Mean Absolute Error: measures the average deviation (error) in the predicted rating vs. the true rating.
Root Mean Squared Error: punishes big mistakes more severely.

Recommender using Collaborative Filtering

Approach 1: Stochastic Gradient Descent as a learning algorithm

We used stochastic gradient descent for optimization of the algorithm by looping through all ratings in the training set.
The system predicts the rating of user u for the user i(rui) and computes the associated prediction error:

The parameters are then modified by a magnitude proportional to γ in the opposite direction of the gradient, yielding:

RMSE after training for 100 epochs: 0.983

Approach 2: Deep Learning as a learning algorithm

Pre-processing the data: This includes looking at the shape of the data, removing inconsistencies and garbage values.
Splitting the data into train and test: We made 80–20 split, using the ‘sklearn.train_test_split’ library.
Defining the Model:

Creating separate embeddings for the Movies matrix and the Users Matrix.
Merging the embedded matrices using a dot product.
Experimenting with various deep learning layers such a Dense Layers, Batch Normalization, Dropout after the dot product of the matrices. (In a combination that gives the best results)

4. Deciding parameters such as hidden layers, Dropout, Learning Rate, Optimizers, Epochs.

5. Training the model

6. Using the trained model for predictions

Experiments

We experimented with different values of number of hidden layers, dropout values, optimizers, learning rates.
We found out that the combination of 1 hidden layer with 50 nodes, 20% dropout, Adam optimizer, 0.001 learning rate and 100 epochs gives the best value of root mean squared error (0.81).

Results

We used matrix factorization and Keras layers to train a deep learning model for our recommendation system.
Once the model is trained, the system can show the Top N Recommended movies for an input userID.
In the following screenshot, we get the top 15 recommended movies for the userID ‘1’.

Recommender using Content-based Filtering

High-level flow

Pre-processing the data: This includes looking at the shape of the data, removing inconsistencies, garbage values, etc.
Splitting the data into train and test: We made 80–20 split, using the ‘sklearn.train_test_split’ library.
Defining the Model:

We use the MovieLens ratings data where each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
We compute the TF-IDF matrix of user ratings data using ‘sklearn.feature_extraction.text’ TfidfVectorizer.
We define an AutoEncoder model which gives an Encoding size of 100 with intermediate layers of size 1000.

4. Deciding parameters such as hidden layers, Dropout, Learning Rate, Optimizers, Epochs.

5. Training the Model

6. Using the trained model for predictions

Methodology

We train an Autoencoder using the TF-IDF values obtained from user rating data.
After the Autoencoder converges, we obtain the encoded word embeddings from the Encoder half of the model and compute Cosine Similarity within the Embeddings.
We then query this Cosine Similarity Matrix with the Movie Id as per the dataset and give back top 20 movies for a given user based on the ratings alone.

Observations

We evaluate the top 20 movies returned for the movies Toy Story and Golden Eye.
We notice that a considerable number of movies adhere to Toy Story’s Adventure, Comedy and Fantasy genres.
However, the model does not provide movies such as other Disney/Pixar/Animated movies or movies which are sequels or prequels to Toy Story which can be safely deemed similar to movies.
We also notice some outliers such as Horror movies and R rated movies that are not pertinent to our query.

Recommendations using content-based filtering

Comparisons and conclusions

Comparing our results to the benchmark test results for the MovieLens dataset published by the developers of the Surprise library (A python scikit for recommender systems) in the adjoining table.
We can see that the deep learning algorithm performs better than the other algorithms, but it takes a long time to train.
The deep learning algorithm is also scalable to a larger dataset without affecting the RMSE value.