Content-based Movie Recommender System

Ankit Raj
Analytics Vidhya

--

Abstract

Content-Based Movie Recommender System built using the cosine similarity scores

Table of Content

1. Executive Summary

2. Introduction

3. Algorithm

3.1. Content-Based filtering

3.2. Cosine Similarity

4. Objective

5. Methodology

5.1. Sample Database schema

5.2. Python-Oracle Database connection

5.3. Preparing the sample dataset

5.4. Identifying the highest rated movies and getting the best movie details

5.5. Merging key features and building a cosine similarity matrix

5.6. Generating Recommendations

5.7. Results

6. Conclusion

7. Project Repository

8. References

1. Executive Summary

Whenever we visit a digital platform, we no longer need to worry about what to watch next as we are served with a bunch of recommendations to choose from. But how exactly the platform decides what to recommend to a specific user, and if the user is going to like that.

In this project, we attempt to build a specific kind of recommendation system by extending the RELMDB oracle database. We attempt to build a recommender system that identifies the best movie, based on the number of IMDb votes and the average IMDb rating, and suggest movies with similar content to the user. We start by importing and munging data from the oracle database to create our base dataset. We, then, identify the best movie and use the cosine similarity algorithm to recommend the top 10 similar movies to the user.

2. Introduction

A recommender system is a type of information filtering system that attempts to predict a user’s preferences and makes suggestions, based on these preferences.

Recommender Systems have become ubiquitous in our lives and have implementations in a wide range of online platforms that we use. Whether we want to purchase online, having trouble deciding which song to listen or what to watch next, in the need of a new friend or looking for someone to date, there are recommendations available for almost everything, we need from an online platform. Even the google search results consider a user’s preferences and the results vary for different users. These systems are, often, able to collect information about a user’s choices and can use this information to improve future suggestions. These recommender systems consider different factors like popularity, the similarity between items, and even similarity between different users’ choices and can use these approaches to improve suggestions in the future. If Netflix, for instance, notices that the trailer of a newly launched Netflix original shared on Facebook is liked by some users, it can use this information to recommend this show to the specific users as soon as the show is released on Netflix. Similarly, if Netflix notices that some users have reacted negatively to some content shared on Facebook, it can resist recommending such content to those users. Additionally, the recommender systems can make recommendations based on the content of a show watched or liked by a user. For example, if Prime Videos observe a user has watched plenty of highly-rated sci-fi movies, it will look for other sci-fi movies with similar content and ratings and suggest those movies to the user.

Though there are plenty of approaches to build a recommender system, this project specifically focuses on a special type of movie recommender system that digs into the details of a movie with the highest number of IMDb votes and highest average IMDb rating, and considering it as the best movie, it recommends top 10 movies having similar content.

3. Algorithm

For our project, we focused on Content-based filtering for generating recommendations.

3.1. Content-Based Filtering

Fig 3.1. Content-Based Filtering

Content-Based filtering doesn’t involve other users, but based on our preference, the algorithm will simply pick items with similar content to generate recommendations for us.

This algorithm offers less diverse recommendations, but it will work regardless of the fact of whether user rates things or not.

For example, there can be a situation where a user potentially likes sci-fi action movies, but he might never know unless he decides to give it a try autonomously. So, what this filter does, it will keep on recommending superhero movies or similar. We can calculate this similarity on many attributes, but in our case, we build this recommender system based on the following key features:

· IMDb Vote

· Mean IMDb Rating

· Genre

· Director

· Cast Member and Cast Role

· Movie Year

· User-specific tags

Now, moving forward to the term that we keep on mentioning, similarity, and what it means in our context. It might not seem like something we could quantify, but it can be measured. Before we proceed to the methodologies, we used to build our content-based recommender system, let’s briefly discuss the concept of cosine-similarity, one of the metrics that can be used to calculate the similarity between users or contents.

3.2. Cosine Similarity

Cosine Similarity can be defined as a method to measure the difference between two non-zero vectors. In our case, the film title and the key movie features represent the coordinates of a movie vector. Thus, in order to calculate the similarity between the two movies, if we know the film title and key features of both the movies, we just need to calculate the difference between the two movie vectors.

The cosine similarity formula can be mathematically described as shown below.

Fig 3.2. Cosine Similarity formula

A.B = Dot product between the two movies vectors,

||A||||B|| = Product of the magnitudes of the two movie vectors

Fig 3.3. Movie vectors representation

As shown in Fig 3.3., the area below the movie vectors A and B, represent the contents of the movies, and the angle, θ, between them represents the similarities between the movie contents. Thus, the lower the angle θ, the more similar the movie contents are.

The cosine similarity values can range between 0 and 1, depending on the θ value bounded between 0 and 90.

4. Objective

The proposed recommender system attempts to identify the movies with contents similar to the best movie using a detailed movie dataset created by merging different tables imported from the RELMDB oracle database. The objective is to create recommendations of the movies, based on their similarity scores with the best-rated movie. Based on the similarity scores, the recommender system would be able to identify the top 10 movies having the highest similarity scores and present them as the top 10 recommendations.

5. Methodology

5.1. Sample Database schema

To build the content-based movie recommender system, the oracle database is used to extend the RELMDB database, modify the tables as per and add a few tables as per our requirement. Below is the final database schema that we have used, in order to create a detailed dataset.

Fig 5.1. Movie Database Schema

5.2. Python-Oracle Database Connection

To import the database tables to the python class, cx_oracle library is used to create a connection with the oracle database.

Fig 5.2. Database Connection

5.3. Preparing the sample dataset

Pandas library functions are used to create the sample dataset.

Fig 5.3. Data Preparation

5.4. Identifying the highest rated movies and getting the best movie details

After sorting the detailed movies dataset by the number of IMDb votes, the mean IMDb ratings are calculated for the movies, and the movie with the highest average IMDb rating is considered as the best or the highest rated movie.

Fig 5.4. Identifying the best movie

5.5. Merging key features and building a cosine similarity matrix

To fit a 2D count vector of the scikit-learn library, all the key features of the movies are merged in a single column and the cosine similarity method of the scikit-learn library is used to create a cosine similarity matrix with the count movie vectors used as the input matrices.

Fig 5.5. Cosine Similarity Matrix

5.6. Generating Recommendations

Top 10 movies are then sorted based on the cosine similarity scores and returned as the final recommendations.

Fig 5.6. Generate Recommendations

5.7. Results

Fig 5.7. Recommendation Results

To evaluate the top 10 recommendations generated by our recommender system, we then compared the recommended movies with the best-rated movie “The Shawshank Redemption”, by running a SQL Script showing the key features like the IMDb ratings, IMDb votes and the film year.

Fig 5.8. Comparison between the recommended movies profiles

By observing these key features, we can see that the recommended movies have a similar profile based on these key features.

6. Conclusion

To conclude, a recommender system powered by content-based filtering performed using the cosine similarity algorithm can make better recommendations for users by suggesting them movies that have similar key features like the IMDb votes, average IMDb rating, genre, release year, casts, directors, user tags, etc.

7. Project Repository

Please visit my GitHub repository.

https://github.com/AnkitRajSri/MovieRecommenderSystem

8. References

https://pdfs.semanticscholar.org/767e/ed55d61e3aba4e1d0e175d61f65ec0dd6c08.pdf

https://medium.com/@bkexcel2014/building-movie-recommender-systems-using-cosine-similarity-in-python-eff2d4e60d24

https://link.springer.com/chapter/10.1007/978-3-642-21793-7_63

https://www.kdnuggets.com/2019/04/building-recommender-system.html

--

--