MOVIE RECOMMENDATION SYSTEM

Rahul Araveti
Web Mining [IS688, Spring 2021]
8 min readApr 26, 2021

All, regardless of age, gender, ethnicity, colour, or geographic place, enjoys movies. Through this incredible medium, we are all linked in some way. What’s most intriguing, though, is how special our movie tastes and variations are. Some people like genre films, such as thrillers, romances, or science fiction, while others are more interested in the lead actors and directors. Taking all of this into account, it’s incredibly difficult to generalize a film and conclude that everybody will enjoy it. Despite this, it is clear that similar films are enjoyed by a particular segment of the population.

What is a Recommendation System?

Simply put, a Recommendation System is a filtration program with the primary purpose of predicting a user’s “rating” or “preference” for a domain-specific object or item. Since the domain-specific object in our case is a movie, the main goal of our recommendation system is to filter and predict only those movies that a user would choose based on certain information about the user.

  • Collaborative Filtering

This filtration technique is focused on a comparison and contrast of the user’s actions with the behavior of other users in the database. This algorithm heavily relies on the past of all users. The key difference between content-based filtering and collaborative filtering is that collaborative filtering considers the interactions of all users with the objects, while content-based filtering only considers the data of the concerned user.

Collaborative filtering can be implemented in a variety of ways, but the key principle to understand is that in collaborative filtering, the data of multiple users affects the recommendation’s outcome. and does not depend on the data of a single consumer for modeling.

There are 2 types of collaborative filtering algorithms:

· User-based Collaborative filtering

The basic idea is to identify users who have similar past preference patterns as user “A” and then suggest to him or her things that those similar users have enjoyed but that “A” has not yet encountered. This is accomplished by creating a matrix of items that each user has rated/viewed/liked/clicked based on the task at hand, calculating the similarity score between the users, and then suggesting items that the concerned user is unaware of but that users similar to him/her are aware of and liked.

For example, if the user ‘A’ likes ‘Batman Begins’, ‘Justice League’ and ‘The Avengers’ while the user ‘B’ likes ‘Batman Begins’, ‘Justice League’ and ‘Thor’ then they have similar interests because we know that these movies belong to the super-hero genre. So, there is a high probability that the user ‘A’ would like ‘Thor’ and the user ‘B’ would like The Avengers’.

1. People’s tastes change over time, and since this algorithm is based on user similarity, it can pick up on initial similarity trends between two users that have entirely different preferences after a while.

2. Since there are many more users than objects, it is difficult to manage such large matrices, which must be recomputed on a regular basis.

3. This algorithm is vulnerable to shilling attacks, in which fictitious user profiles with skewed preference patterns are used to influence key decisions.

· Item-based Collaborative Filtering

In this case, the idea is to find similar movies rather than similar users, and then suggest similar movies to ‘A’ based on his or her previous tastes. This is accomplished by locating any pair of items that were rated/viewed/liked/clicked by the same user, calculating the similarity of those rated/viewed/liked/clicked among all users who rated/viewed/liked/clicked both, and then recommending them based on the similarity scores.

For example, let’s take two movies, ‘A’ and ‘B,’ and compare their ratings from all users who have rated both, and find related movies based on the similarities of these ratings and the rating similarity by users who have rated both. So, if most common users have rated ‘A’ and ‘B’ both similarly and it is highly probable that ‘A’ and ‘B’ are similar, therefore if someone has watched and liked ‘A’ they should be recommended ‘B’ and vice versa.

Let’s start coding up our own Movie recommendation system

In this implementation, when a user searches for a movie, our movie recommendation system will suggest the top 10 related films. For this, we will use an item-based collaborative filtering algorithm. The movielens-small dataset was used for this demonstration.

Getting the data up and running

First, we need to import libraries which we’ll be using in our movie recommendation system. Also, we’ll import the dataset by adding the path of the CSV files.

Now that we have added the data, let’s have a look at the files using the dataframe.head() command to print the first 5 rows of the dataset.

Let’s have a look at the ratings dataset:

Ratings dataset has-

  • userId — unique for each user.
  • movieId — using this feature, we take the title of the movie from the movies dataset.
  • rating — Ratings given by each user to all the movies using this we are going to predict the top 10 similar movies.

Let’s have a look at the movies dataset:

Movie dataset has

  • movieId — once the recommendation is done, we get a list of all similar movieId and get the title for each movie from this dataset.
  • genres — which is not required for this filtering approach.

We can see that userId 1 has watched movieIds 1 and 3 and given them both a 4.0 rating, but has not given movieId 2 a rating. This understanding is more difficult to derive from this dataset. As a result, we’ll create a new dataframe with each column representing a unique userId and each row representing a unique movieId to make it easier to understand and work with.

Now it’s far clearer that userId 1 has given movieId 1& 3 a 4.0 rating but hasn’t given movieId 3,4,5 a rating (so they’re described as NaN ) and therefore their rating data is missing.

Let’s address this by replacing NaN with 0 to make it easier for the algorithm to understand while also making the data more pleasing to the eye.

Removing Noise from the data

In the real world, ratings are scarce, and data points are often gathered from widely successful films and active users. We wouldn’t like movies with a small number of user ratings because they aren’t trustworthy. Similarly, users who have just rated a few movies should not be taken into consideration.

So with all that taken into account and some trial and error experimentations, we will reduce the noise by adding some filters for the final dataset.

  • To qualify a movie, a minimum of 10 users should have voted a movie.
  • To qualify a user, a minimum of 50 movies should have voted by the user.

Let’s visualize how these filters look like

Aggregating the number of users who voted and the number of movies that were voted.

Let’s visualize the number of users who voted with our threshold of 10.

Making the necessary modifications as per the threshold set.Let’s visualize the number of votes by each user with our threshold of 50.

Making the necessary modifications as per the threshold set.

Removing sparsity

Our final dataset is 2121 * 378 in size, with the majority of the values being sparse. Since we are only using a small dataset, the original large dataset of movie lenses, which contains over 100000 features, may cause our device to run out of computational resources when fed to the model. We use the scipy library’s csr matrix feature to reduce the sparsity.

I’ll give an example of how it works :

As you can see there is no sparse value in the csr_sample and values are assigned as rows and column index. for the 0th row and 2nd column, the value is 3.

Applying the csr_matrix method to the dataset :

Making the movie recommendation system model

We will be using the KNN algorithm to compute similarity with cosine distance metric which is very fast and more preferable than pearson coefficient.

Making the recommendation function

The theory of operation is very straightforward. We first search to see if the movie name entered is in the database, and if it is, we use our recommendation method to find similar films, sort them by similarity distance, and output only the top 10 films with their distances from the input film.

Finally, Let’s Recommend some movies!

I personally think the results are pretty good. All the movies at the top are superhero or animation movies which are ideal for kids as is the input movie “Iron Man”.

Let’s try another one :

All the movies in the top 10 are serious and mindful movies just like “Memento” itself, therefore I think the result, in this case, is also good.

We will be using the KNN algorithm to compute similarity with cosine distance metric which is very fast and more preferable than pearson coefficient.

--

--