Content-based Recommender Systems in Python

This blog illustrates a content-based recommender system in python

Published in

Analytics Vidhya

5 min readJan 2, 2020

This is my first series of blogs in the new decade starting 2020 and therefore I am pretty much excited. Before starting with illustrating content-based recommender systems in python, I will recommend you to give a short 4-min read to this blog which defines a recommender system and its types in laymen terms.

https://medium.com/@saketgarodia/the-world-of-recommender-systems-e4ea504341ac?source=friends_link&sk=508a980d8391daa93530a32e9c927a87

Through this blog, I will show how to implement a content-based recommender system in Python on Kaggle’s MovieLens 100k dataset.

MovieLens 100K Dataset

Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies

www.kaggle.com

Let us start implementing it.

Problem formulation

To build a recommender system that recommends movies based on the plot of a previously watched movie.

Implementation

First, let us import all the necessary libraries that we will be using to make a content-based recommendation system. Let us also import the necessary data files.

#importing necessary librariesimport numpy as npimport pandas as pdfrom sklearn.metrics.pairwise import cosine_similarityfrom sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import linear_kernel#putting movies data on 'movies' dataframemovies = pd.read_csv('movies_metadata.csv')

Since we are building a plot based recommender system, let us only select the columns we will be using. We will be using movie ‘id’, movie ‘title’ and ‘overview’ (overview details the plot of each movie ).

Here’s how the data looks now.

Let us see how a movie plot looks like in the dataset.

movies[‘overview’][0]

This is how the plot of the movie ‘Toy Story’ looks in the dataset: “Led by Woody, Andy’s toys live happily in his room until Andy’s birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy’s heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.”

We have a dataset of around 45466 movies which is good enough to build a model that will recommend us movie based on the plot. It is going to be very interesting.

As a first step, we will use TfidfVectorizer which will basically convert our ‘overview’ (a text column ) into numerical. All the data science models run on numerical values since computers can only understand 0s and 1s.

TF-IDF basically is Term Frequency-Inverse Document frequency. The number of features it creates is equal to the total number of distinct words used in the overview column and the values are directly proportional to the number of times a particular word is used and inversely proportional to the number of documents (movies here) in which the word is used. It will penalize a word even though a word has a huge number for a movie but is common to many movies. The words which occur multiple times but are common to many movies are anyways not so helpful in differentiating different movies.

tfidf = TfidfVectorizer(stop_words=’english’)movies[‘overview’] = movies[‘overview’].fillna(‘’)#Construct the required TF-IDF matrix by applying the fit_transform method on the overview featureoverview_matrix = tfidf.fit_transform(movies[‘overview’])#Output the shape of tfidf_matrixoverview_matrix.shape#Output
(45466, 75827)

Now, we have a ‘tfidf’ feature matrix for all the movies. Every movie has 75927 number of features (words ). Now, in order to find the similarity between the movies, we will use the cosine_similarity. In our case, the linear_kernel function will compute the same for us.

Cosine_Similarity is basically a measure of the similarity between 2 vectors. This measure is the cosine of the angle between them. Here, we have 75927 features (tfidf values) for each movie. Let us now find the similarity matrix using linear_kernel function:

similarity_matrix = linear_kernel(overview_matrix,overview_matrix)similarity_matrix

Now, let us create a series that maps the index of the matrix to movie names to make it easy for us to just feed in movie names and get the recommendation.

#movies index mappingmapping = pd.Series(movies.index,index = movies[‘title’])mapping

Now, we will make a recommender function that will recommend us movies using cosine_similarity. Our function will take a movie name as input and then find the top 15 movies using the cosine similarity matrix we found above.

def recommend_movies_based_on_plot(movie_input):movie_index = mapping[movie_input]#get similarity values with other movies#similarity_score is the list of index and similarity matrixsimilarity_score = list(enumerate(similarity_matrix[movie_index]))#sort in descending order the similarity score of movie inputted with all the other moviessimilarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)# Get the scores of the 15 most similar movies. Ignore the first movie.similarity_score = similarity_score[1:15]#return movie names using the mapping seriesmovie_indices = [i[0] for i in similarity_score]return (movies[‘title’].iloc[movie_indices])

Lets now try to get a recommendation for the movie ‘Life Begins for Andy Hardy’ from the above recommendation function and see what it outputs.

recommend_movies_based_on_plot(‘Life Begins for Andy Hardy’)

We can finally see that when we input a movie ‘Life Begins for Andy Hardy’ in this case, we get 15 recommendations of movies whose plots are similar to this movie. Its magical. Isn’t it?

To know about the Metadata and Collaborative-Filtering based approaches, go through my following blogs:

Meta-data based Recommender Systems: https://medium.com/@saketgarodia/metadata-based-recommender-systems-in-python-c6aae213b25c
Recommender Systems using Collaborative Filtering: https://medium.com/@saketgarodia/recommendation-system-using-collaborative-filtering-cc310e641fde

Thanks for reading.

Please do post feedback.