Content-Based Filtering

Published in

Movie Recommendation System

3 min readApr 23, 2019

Basically, this algorithm is recommending stuff based on attributes of the item. For example, recommending movies based on their genres and released year. If we have genres like romance, comedy, action, drama, and horror for a movie and user liked that movie, it is good to recommend the movies with the same genres. For more accurate recommendation we can use the released year attribute.

In our discussion, we are having 18 genres for movies. Take a look at the above dataset. There should be some sort of similarity measure to check how many genres the movies have in common.Cosine Similarity metrics is the best way to measure the similarity.

Cosine Similarity

For easy understanding, if a movie A has romance and drama as it’s genres. Movie B has romance only as it’s genre. When we draw a 2D graph for the genres, we have to consider the romance as X-axis and drama as the Y-axis. If we give value one for a genre the movie A will have the coordinate of (1,1) in the graph and movie B will have the coordinate of (1,0).

For the above-said example, the theta between the genres is 45 then the value is around 0.7. So the similarity score is 0.7. If we consider movies without any genres in similar then the theta between those genres is 90 and the similarity score is 0 (Cosine 90=0). If the movies have the same genres the theta is 0 and the similarity score is 1 (Cosine 0 = 1). So the similarity score range is between 0 & 1.

So, when we are considering all genres, how can we scale up to all genres, We have to think each genre as a dimension. It is hard to imagine with our mind but the calculation is simple exactly movies with 2 genres.

The equation for multi-dimensional cosine

Here A and B are the movies. We have to through each genre at a time and after calculating all genres sum up them all and finding the total similarity scores. This will look like a complex equation but it can be easily understood when implemented in python.

This function takes in two movie IDs and a dictionary that maps movie IDs to their 18-dimensional coordinates in genre space. It just goes through every genre, one at a time, extracts the value of the genre from movies X and Y, and keeps a running sum of X squared, Y squared, and X times Y. Then, the cosine similarity is just a matter of putting it all together.

KNN

The concept of k-nearest-neighbors is just selecting some number of things
that are close to the thing you’re interested in, that is, its neighbors, and predicting something about that item based on the properties of its neighbors. So to turn these top 40 (K=40) closest movies into an actual rating prediction, we can just take a weighted average of their similarity scores to the movie whose rating we’re trying to predict, weighting them by the rating the user gave them. And recommending the movies for the user.

This is the core methods for a content-based recommendation algorithm.

CodeSource @http://media.sundog-soft.com/RecSys/RecSys-Materials.zip

Content-Based Filtering

Written by Premashanth Kumanan