Simple Recommender System for Movies

Falana Olumide
5 min readSep 30, 2020

--

Objective:

To develop a simple algorithm that recommends movies in the MovieLens data by looking at the movies that are closest to it in terms of genre and popularity (rating).

Concept:

Every movie in the data set has additional information on it. For instance, the genre it belongs to such as science fiction, comedy, drama, and action. Also, we shall look at the popularity of the movie, that is the number of people that rated it and the average rating of each movie. Therefore, we shall combine each movie genre and rating to create a metric of the distance between each film to implement the recommendation system.

Scope:

It is important to note that fully developed and deployed recommendations systems are complex and resource-intensive. It requires heavy linear algebra background. Conversely, we shall create a readable and uncompounded version using ‘item similarity’ in this project.

Methodology:

Before we introduce the codes, let us quickly discuss the types of recommender systems. There are two most common types:

Content-Based — focus on the attributes of the items and give a recommendation based on the similarity between them.

Collaborative Filtering — gives a recommendation based on the knowledge of the users’ attitude. The system recommends based on the wisdom of the crowd (In such cases, you would see something like ‘people that viewed movie A also viewed movie B and C’).

In real-world, Collaborative Filtering (CF) is commonly used than Content-Based Systems because it gives a better result, and it is relatively easy to understand provided you have linear algebra background.

Similarly, the CF algorithm can perform feature learning on its own that is it can determine what features to use when recommending items. In a broader view, CF is subdivided into Memory-based CF and Model-based CF. we shall consider Memory-based CF by computing cosine similarity in another publication. However, in this publication, we shall create a content-based recommender system for a data set of movies.

Please click here for the entire scripts.

Exploratory Data Analysis

The actual number of ratings of the movies is shown in the Figure below. There are some peaks at the whole number (1,2,3,4, and 5). This makes sense because that is how people will actually rate movies. However, most ratings are distributed normally around 3 and 3.5 stars. Similarly, there are outliers of 1 and 5-star movies. The 1-star ratings indicated that the viewers were not pleased with the movies and gave it a rating of 1 star. At the far end is a peak of 5-star, which are popular movies.

We shall go ahead with another visualisation (Jointplot). We are interested in the distribution or relationship between the actual rating and the number of rating.

The figure below shows that the higher the rating of the movies the higher the watch and some badly rated movies(1.0) are watched fewer times. Furthermore, movies with 5 stars are watched once and it suggests that they are movies of some blockbusters.

Tentatively, we can say that most of the interesting movies are rated between 3.5 to 4.5 and the rating pulled more views. This makes sense because the better the movie the more likely people will view it.

Creating a Recommender System

We shall use a Pivot table to create some relevant column titles in matrix form.

Not all movies were rated by viewers. However, we checked some of the most rated movies again.

Using Star Wars and Liar Liar, we shall create a recommendation based on rating using the following code

Please click here for the entire scripts.

The result shows the correlation of other movies with star wars. The high correlation is because some of the viewers who watched star wars also gave a 5-star review to the other movies. Though, the result does not show the number of people that rated the movies. To resolve this challenge, we shall filter the number of those that gave the review using the following code.

The reason we used ‘join’ is that ‘Title’ is the index of the dataframe, which makes it a good fit for the join method and filter out any movie that does not have more than 100 ratings.

Star Wars is an action movie and it makes sense that the empire strikes back (1980) is the most correlated movie with the Star Wars (1977). Also, the Return of the Jedi is another star war movie with high correlation. Similarly, If you enjoy Star Wars movie, you would love Raiders of the Lost Ark. Nevertheless, there is a drop in correlation with Austin Powers because it is a comedy movie and you may or may not like the movie like star wars.

One of the reasons that it may have a relative correlation with star wars is that it was also popular like star wars and the recommendation system recommends popular items to people that love popular things.

CONCLUSION

We have seen how we can develop python code to recommend movies based on the set threshold and how we can filter out some recommendation that doesn’t make sense using corr or corrwith in Python. I hope you enjoy the project, thank you for reading.

Please click here for the entire scripts.

--

--

Falana Olumide

Olumide is a Data Scientist familiar with gathering, cleaning, and organizing data for use by technical and non-technical personnel.