Building a Movie -Recommender-System

Table of content

ThankGod Onmeje
5 min readAug 31, 2022

Introduction

Types of recommender system

I. Content-based

II. Popularity-based

III. Collaborative filtering(FC)

Building a movie recommender system

Repo file structure

Dataset

Model performance

Deployment

Limitation

Conclusion

Reference

Introduction

In recent times, technology has been fast evolving for humanity, it’s become very common to be constantly faced with various tasks that require devoting a lot of time to finish accurately. Incorporating the process of Automation to tackle these everyday tasks has contributed to the growth of industries in their various fields. One such automation is a ‘recommender system’. With the power of AI, this simple recommender system has been built and embedded in our device applications and various platforms by combining different natures of data models i.e. While surfing Instagram we see ads on items that we must have looked up in our Google search, or how YouTube and Netflix uses a recommender system to automatically suggest movies and videos to users based on their past preference providing users with unlimited options to select from to their maximum satisfaction.

Types of recommender system

Recommender systems are of different types which include:

● Content-based,

● Popularity based,

● Collaborative filtering

● Hybrid

types of recommendation system

Content-based

Content-based recommender systems were the first type of recommender systems, they predict ratings based on the content of the product. Products with similar content or features are recommended to the user. While this method performs excellently well in recommending items that will suit the user based on historical data it still has its shortcomings. i.e. Lack of user historical data leading to the issue of ‘Cold-start’ for new users, sparsity, and the tendency to create a filter bubble by constantly recommending already consumed items are a few of its limitations.

Popularity based

Here the principle of popularity is used. The system checks for the most popular product or content and recommends it to the user. An example of this is YouTube’s trending section. Where trending videos are recommended to users. The advantage it has over Content Based recommenders is that there is no need of having the user’s historical data. One major drawback with is approach is that the individual preference of the user is not considered in the recommendation and there is a high chance of the user not liking the trending product.

Collaborative filtering (CF)

Using this method, the preference of similar users is considered. For example, if user A and user B both likes movie X and user A also likes movie Y we can recommend movie Y to user B. CF can be-

1. User-based: Basically, measures the degree of similarity between the target user and other users.

2. Item-based: Measures the similarity between items the user must have rated or interacted with and other items.

Besides the issue of Cold-start and scalability CF also suffers from sparsity which happens as a result of having a large number of objects in a collection and having users who rate only a small part of the collection. This is a big issue as the system will obviously favor mainstream items, without focusing on other items. This led to the development of a Hybrid model.

Hybrid Models/Systems

This is when one or more of the aforementioned types are combined in a recommendation system. They make up a comprehensive model by combining the properties of both (CF and Content-based) approaches.

Building a movie recommendation system

In this project, the following python libraries were used — Pandas, NumPy, sklearn, matplotlib FastAPI, and scikit-surprise. Two models were built the first using the collaborative filtering algorithm (Probabilistic Matrix Factorization for the surprise library) and the other was the Pearson correlation using the Pandas library.

Repo file structure

Below is the snapshot of the file repo

Dataset

The datasets used for this project can be gotten here. The Kaggle data consists of five datasets, the first four include combined data.txt and the movies_titles.csv.

Recommendation models are implemented in 2 major ways in this project:

● Collaborative filtering: Collaborative filtering is a technique used to filter out items that a user might like on the basis of reactions by similar users. It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user.

● Pearson R’ Correlation: — Pearson’s R correlation measures the linear correlation between review scores of all pairs of movies and then provides a list of the top 10 with the highest correlation

● Other methods of finding the similarity between 2 movies are using a method of cosine similarity, Tensor Flow-recommender, etc.

Model performance

The Model was cross-validated on a five-fold split and below is the performance

model performance

Deployment

The model was wrapped around an API using FastAPI though the model has not been deployed for the same reason the training files were not included in this repository.

Limitation

The dataset was really large and working with the entire dataset can crash the system due to limited computational power. This limitation we believe can be overcome by using services like amazon, azure, and Google. For this same reason, the model was unable to be deployed on Heroku.

Conclusion

In the future, we hope to explore other solutions for overcoming this limitation such as using Azure, Google Service, and DVC to manage this kind of project. Thanks for your time and we hope you learned a thing or two from this post. We welcome your comments and recommendations.

References

The link to the GitHub repo here

--

--