Movie Recommender Web App

Saurabh Kumar
Analytics Vidhya
Published in
6 min readApr 5, 2020

Content Based

Movie recommendation system is one of the basic and important machine learning project. In this blog, I will show you how to create a content based movie recommender system with a beautiful website and deploy the model on Heroku.

To open movie recommender system click on the link:

https://nikmoviemaniac.herokuapp.com/

Data Gathering

I have used TMDB 5000 movie dataset which can be found on kaggle using this link:

https://www.kaggle.com/tmdb/tmdb-movie-metadata.

This dataset contains two csv files:

  1. tmdb_5000_credits.csv
  2. tmdb_5000_movies.csv

Data Preprocessing

I will preprocess the dataset in Jupyter-Notebook.

Read both the dataset using pandas:

Merge both the csv files on id and movie_id columns and name the combined csv file as df. After merging there will be large number of columns.

To build a content based recommender system we don’t need many of the columns such as budget, homepage, popularity, runtime etc. So, I will drop these columns and final columns will left with us as,

First column of df dataframe

Many columns in this dataset such as genres, keywords etc. are in Json format, so, we have to decode them and extract useful contents from them.

Now, dataset will look like as follows:

Check, is there any null value present in the dataset or not, if yes, then clean the dataset. Upto here, there are 4803 rows and 10 columns.

In many columns, I have elements in the form of list. Now, I will remove commas in the lists and I will remove space between each element (ex- convert ‘science fiction’ to ‘sciencefiction’) of the list so that we can get meaningful keywords because, in later steps I am going to combine features to perform some very important steps, which we will see later. I am also converting the texts to lowercase because, Python is case-sensitive, it will consider ‘adventure’ and ‘Adventure’ differently.

Now, lets remove commas, spaces and convert list to string,

“overview, tagline, keywords” columns of the dataset contains unuseful elements such as numbers and stopwords and to tackle this, I have come across a very useful and important algorithm which extracts important words from the text, the algorithm is Rapid Automatic Keyword Extraction (RAKE) algorithm.

I am going to apply RAKE algorithm to columns of the dataset.

Combine all the columns except original_title and movie_id because in content based movie recommender system there is no use of movie_id and about original_title, it is the name of the movies which needs to be in a separate column.

Now, I am dropping all the unnecessary columns by keeping only original_title and combined_features columns.

And the dataset will look like as follows:

Since user can search a movie in any format for example let’s consider ‘batman v superman: dawn of justice’ movie, user can search as ‘batman v superman’, ‘batman superman’, ‘batman dawn of justice’ etc. So, I will remove all the unuseful characters and combine all the words from original_title column and also from the user’s input.

And finally convert this format of the dataset to a new csv file.

You can find jupyter-notebook file containing all the codes by clicking on this link, https://github.com/NIKsaurabh/Movie-Recommendation-Web-App/blob/master/tmdb.ipynb

Here, I have completed the data preprocessing steps and created a final csv file for further uses.

Creating a Flask App

In this python file, I will use the csv file which I have created earlier.

I am using NearestNeighbors method to find most related movies based on user’s input.

First I will import the dataset and after that, the main step I will perform is that, I will create count_matrix using CountVectorizer.

CountVectorizer will count number of words present in a text.

CountVectorizer works as follows for every texts.

Now, I will create similarity score matrix, which is a square matrix and contains values between 0 and 1, because here I am using cosine similarity and value of cos lies between 1and 0.

If two movies are highly related then angle between them is close to zero and similarity score close to 1.

Now, I will take user’s input using HTML page which I have created, link to whole code (html, css etc.) is provided at the end of the blog.

I will remove all the unwanted characters from the user’s input.

If the title same as user’s input is present in the ‘title’ column of the dataset then find the index of the user’s choice and using this index we can find the indices and distances of the related movies using NearestNeighbors.kneighbors().

If user’s input doesn’t match with any row of the ‘title’ column then, extract list of movies name which contains the user’s input and sort the list and using the first movie name in the sorted list find its index and perform above steps to find indices and distances of related movies.

The flask code in the python file looks likes this,

Here main_page.html is the web page where user enters the movie name. movie_list.html is the web page where list of recommended movies are shown.

To deploy the model to Heroku we need two more files,

  1. Procfile
  2. requirements.txt

Flask searches the HTML files in “templates” folder and CSS files to the “static/styles” folder so, never forget to place your files at right destination otherwise it will not work.

Upload all the files and folder to the github.

Open Heroku and create an account or sign in.

Click on New/Create new app.

Enter app name and click on Create app.

Click on GitHub and search your repository.

Click on Connect.

Then click on Deploy Branch.

If everything goes well then it will start to install packages listed in requirements.txt file and at the last it will give an url ending with herokuapp.com .

You can find whole project by clicking link given below:

Thank You ;)

--

--

Saurabh Kumar
Analytics Vidhya

Actively looking for an opportunity in the field of Machine Learning and Data Science.