Collaborative Filtering Movie Recommendation

Srinidhi Vaddy
7 min readMay 10, 2020

--

Being at home during the lockdown and in quarantine has given all of us extra time. While many are busy with work from home jobs and deadlines, a lot of us are keeping ourselves occupied by binge watching our favourite TV shows and movies. There’s a very clear trend in the increased usage of Netlifx, Prime video and other media services providers.

By the time we finish one movie, we have an even better recommendation waiting for us to watch. It fascinated me how well the recommendations are and its so specific to each user. I ended up reading more about movie recommendation systems and wanted to build one myself. While the UI/UX part of it is still under development, I was able to build the basic running model of the recommender.

Even though peoples’ tastes may vary, they generally follow certain patterns. There are similarities in the things that people tend to like or tend to like things in the same category or that share the same charcateristics. People also tend to have similar tastes to those of the people they are close to. Recommender systems try to capture these similar behvaiors amongst users and the patterns specific to each user and then to help predict what else they might like. They are used to personalize our experience as much as possible on the web. Everything on Netflix is driven by customer selection, liking and pattern. On social media sites or news websites on the web, you tend to get friends, books, news recommendations based on mutual friends, liking, choices and recently read news respectively.

They provide us with a broader exposure to many different products and encourage continual usage of their product. This in turn benefits the service provider, with increased revenue and better security for its customers and less churning.

Mainly, there are 2 types of recommendation systems: Content-based and Collaborative Filtering. Content based is driven by : “I need to see more of what I like” and tries to figure out what a user’s favourite attributes of an item are and then make recommendations based on those aspects. Content based implies that it uses the past preferences or ratings of the user on movies already watched to suggest other movies.

Collaborative Filtering is based on, “Tell me what’s popular among my friends and other users and I would like to watch that too!” It tries to find group of users who are most similar and provide recommendation to one particular user based on the similar liking within the group. This is compartively more complex and less seen on Netflix, according to my own personal observation.

Implementation of recommender systems is of two types: Memory-based and Model-based. Memory-based approach uses the entire user-tiem dataset to generate a recommendation using statistical techniques like Pearson Correlation, Euclidean distance, Cosine Similarity, etc. Model-based approach builds a model of users to learn their preferences and is created using Machine Learning techniques like regression, clustering, classification.

BUILDING THE RECOMMENDER SYSTEM

Also called User-User Filtering, it uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users and the one we will be using here is going to be based on the Pearson Correlation Function.

COLLABORATIVE FILTERING

For this, I used the MovieLens 20M Dataset by Grouplens which has over 20 million movie ratings since 1995 with title, genre and their ratings given by users. https://grouplens.org/datasets/movielens/

The dataset consists of six files. Tag.csv contains tags applied to movies by users. Rating.csv contains ratings of movies by users. Movie.csv contains movie information and others. But for the recommendation, only the movie.csv and rating.csv files were used and read using pandas library.

movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
rating= pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')
movie.head()

Using pandas’ replace function, we remove the year from the title in the movie dataframe and add a separate year column, using regular expressions. Find a year stored between parentheses. We specify the parantheses so we don’t conflict with movies that have years in their titles.

movie['year'] = movie.title.str.extract('(\\d\d\d\d\))',
expand=False)
#Removing the parentheses
movie['year'] = movie.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movie['title'] = movie.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movie['title'] = movie['title'].apply(lambda x: x.strip())

Collaborative filtering doesn’t recommend based on the features of the movie. The recommendation is based on the likes and dislikes or ratings of the neighbours or other users. So drop the genre column, since there is no use of it.

movie.drop(columns=['genres'], inplace=True)

Now, coming to the ratings dataframe, the movieId column that is common with the movie dataframe. Each user has given multiple ratings for different movies. The column Timestamp can be dropped similar to the Genres column in movie dataframe.

The process for creating a User Based recommendation system is as follows:

  • Select a user with the movies the user has watched
  • Add movieIds to the movies watched by the user for easy recommendation
  • Based on his rating to movies, find the top X neighbours
  • Get the watched movie record of the user for each neighbour.
  • Calculate a similarity score using some formula
  • Recommend the items with the highest score
user = [
{'title':'Breakfast Club, The', 'rating':4},
{'title':'Toy Story', 'rating':2.5},
{'title':'Jumanji', 'rating':3},
{'title':"Pulp Fiction", 'rating':4.5},
{'title':'Akira', 'rating':5}
]
inputMovie = pd.DataFrame(user)
inputMovie
#Filtering out the movies by title
Id = movie[movie['title'].isin(inputMovie['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovie = pd.merge(Id, inputMovie)

Finding the users who have seen the same movies from the rating dataframe With the movie IDs in our input, we can now get the subset of users that have watched and reviewed the movies in our input. Following by create subgroups, that is, grouping all the found users based on their userID to find the top most similar users to the input user.

#Filtering out users that have watched movies that the input has watched and storing it
users = rating[rating['movieId'].isin(inputMovie['movieId'].tolist())]

userSubsetGroup = users.groupby(['userId'])
#Sorting it so that users with movie ratings most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

We have to find out how similar each user is to the input userthrough the Pearson Correlation Coefficient used to measure the strength of a linear association between two variables.

Why Pearson Correlation? It is invariant to scaling. For example, if you have two vectors X and Y, then pearson(X, Y) == pearson(X, 2 * Y + 3). This is important because two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales. The value varies from r = -1 to r = 1, where 1 means that the two users have similar tastes while a -1 means the opposite.

Pic courtesy: Byjus.com

Once we get the similarity index/value for all the users, the userId, movieId (to get the movies seen by all users and then to recommend) and ratings columns are also added. We then calculate the weighted rating for each seen movie based on the similarity index, that is, the rating of the movie is now based on how similar each user is to the input user. The most similar user will give the highest weight factor to the rating.

#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']

For each distinct movie, we then calculate a weighted average similarity index of all users and weighted average rating, which is then used for the weighted recommendation score for each movie.

#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']

When sorted in descending order, we get the top 10 movies with the highest recommendation score. These are the top 10 recommendations for the input user based on what others are watching!

The entire source code, dataset and output is available on Kaggle. https://www.kaggle.com/srinidhi14vaddy/collaborative-filtering-movie-recommendation Please upvote if you find it helpful!

All suggestions and comments on tweaks and improvements are most welcome! Will soon publish about the content-based recommendation system.

--

--