Content-based Recommender Systems in Python

This blog illustrates a content-based recommender system in python

Saket Garodia
Analytics Vidhya
5 min readJan 2, 2020

--

This is my first series of blogs in the new decade starting 2020 and therefore I am pretty much excited. Before starting with illustrating content-based recommender systems in python, I will recommend you to give a short 4-min read to this blog which defines a recommender system and its types in laymen terms.

https://medium.com/@saketgarodia/the-world-of-recommender-systems-e4ea504341ac?source=friends_link&sk=508a980d8391daa93530a32e9c927a87

Through this blog, I will show how to implement a content-based recommender system in Python on Kaggle’s MovieLens 100k dataset.

Let us start implementing it.

Problem formulation

To build a recommender system that recommends movies based on the plot of a previously watched movie.

Implementation

First, let us import all the necessary libraries that we will be using to make a content-based recommendation system. Let us also import the necessary data files.

Since we are building a plot based recommender system, let us only select the columns we will be using. We will be using movie ‘id’, movie ‘title’ and ‘overview’ (overview details the plot of each movie ).

Here’s how the data looks now.

Let us see how a movie plot looks like in the dataset.

This is how the plot of the movie ‘Toy Story’ looks in the dataset: “Led by Woody, Andy’s toys live happily in his room until Andy’s birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy’s heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.”

We have a dataset of around 45466 movies which is good enough to build a model that will recommend us movie based on the plot. It is going to be very interesting.

As a first step, we will use TfidfVectorizer which will basically convert our ‘overview’ (a text column ) into numerical. All the data science models run on numerical values since computers can only understand 0s and 1s.

TF-IDF basically is Term Frequency-Inverse Document frequency. The number of features it creates is equal to the total number of distinct words used in the overview column and the values are directly proportional to the number of times a particular word is used and inversely proportional to the number of documents (movies here) in which the word is used. It will penalize a word even though a word has a huge number for a movie but is common to many movies. The words which occur multiple times but are common to many movies are anyways not so helpful in differentiating different movies.

Now, we have a ‘tfidf’ feature matrix for all the movies. Every movie has 75927 number of features (words ). Now, in order to find the similarity between the movies, we will use the cosine_similarity. In our case, the linear_kernel function will compute the same for us.

Cosine_Similarity is basically a measure of the similarity between 2 vectors. This measure is the cosine of the angle between them. Here, we have 75927 features (tfidf values) for each movie. Let us now find the similarity matrix using linear_kernel function:

similarity-matrix

Now, let us create a series that maps the index of the matrix to movie names to make it easy for us to just feed in movie names and get the recommendation.

Now, we will make a recommender function that will recommend us movies using cosine_similarity. Our function will take a movie name as input and then find the top 15 movies using the cosine similarity matrix we found above.

Lets now try to get a recommendation for the movie ‘Life Begins for Andy Hardy’ from the above recommendation function and see what it outputs.

We can finally see that when we input a movie ‘Life Begins for Andy Hardy’ in this case, we get 15 recommendations of movies whose plots are similar to this movie. Its magical. Isn’t it?

To know about the Metadata and Collaborative-Filtering based approaches, go through my following blogs:

  1. Meta-data based Recommender Systems: https://medium.com/@saketgarodia/metadata-based-recommender-systems-in-python-c6aae213b25c
  2. Recommender Systems using Collaborative Filtering: https://medium.com/@saketgarodia/recommendation-system-using-collaborative-filtering-cc310e641fde

Thanks for reading.

Please do post feedback.

--

--

Saket Garodia
Analytics Vidhya

Senior Data Scientist at 84.51(Kroger), AI/Data Science, Psychology, economics, books; Linkedin — https://www.linkedin.com/in/saket-garodia/