Movie Recommendation System | Python AI Web Application | Cloud Deployment

Praveen Kumar
7 min readJan 30, 2020

--

The final Web-Application can we checked HERE. (Be patient — the clound run is slow)

The complete code is on Github. Check HERE.

Welcome, cinephiles and coders alike! Today, we’re diving into the fascinating world of recommendation systems. Specifically, we’re constructing a Movie Recommendation System using Python, leveraging the power of Kaggle datasets. This system is designed to suggest movies to users based on their viewing history and preferences. So, grab your popcorn, and let’s get coding!

In this blog, I am going to explain the following:

  1. Data Preprocessing
  2. Building Movie Recommender Machine Learning Model
  3. Creating Web Pages and connecting them to Django Rendering
  4. Create a complete interface and exception handling using Python along with form validation
  5. Deploying to Render.com

So, without further ado, let’s get started. Also, feel free to skip any section if you are already familiar with it.

Setting the Stage: Importing Libraries and Datasets

First things first, we need to set up our environment by installing necessary packages and importing our data. We’ll use pandas for data manipulation, numpy for numerical operations, and several other tools for specific tasks like feature extraction and similarity computation.

!pip install --quiet fastparquet
!pip install --quiet pyarrow

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pyarrow as pa
import pyarrow.parquet as pq
import warnings
warnings.simplefilter('ignore')

We load our movie datasets — movies_metadata.csv, credits.csv, keywords.csv, and links.csv. These datasets provide a comprehensive look at various aspects of the movies, including metadata, cast, crew, and keywords associated with each movie.

Data Limitations

Let’s look at the dataset first. We are going to use TMDB5000 DATASET from Kaggle.com. The dataset consists of 2 files, namely, tmdb_5000_credits.csv & tmdb_5000_movies.csv. There is another dataset called THE MOVIES DATASET which has more than a million movie reviews and ratings. However, I did not use it for 2 reasons.

  1. The dataset is too large for the system & requires an estimate of 45–50GB RAM.
  2. The machine learning model produced is also too large for Render. Render does not allow us to store more than 250MB on a free account.

Data Preprocessing: The Art of Cleaning Data

Data preprocessing is crucial in any data science project. We start by dropping rows with inconsistent data types, extracting relevant information, and merging datasets to create a master dataset. This master dataset will serve as the backbone of our recommendation system.

Key Steps in Data Preprocessing:

  1. Extracting Genres: Convert genres from a string representation to a list.
  2. Type Conversion: Ensure the ‘id’ column across different datasets is of the same type for successful merging.
  3. Merging Datasets: Combine movies_dataset, credits, and keywords to form a unified dataset.
  4. Parsing Columns: Convert the stringified lists back into lists for columns like cast, crew, and keywords.
  5. Extracting Director’s Name: A function get_director is created to extract the director's name from the crew information.
  6. Cleaning and Trimming Data: Focus on top cast members, stem keywords, and lowercase transformations to maintain uniformity and relevance.
  7. Creating a Soup Feature: A combination of keywords, cast, director, and genres for each movie. This ‘soup’ becomes the basis of our content-based recommendation system.
movies_dataset['genres'] = movies_dataset['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies_dataset['id'] = movies_dataset['id'].astype('int')
master_dataset = movies_dataset.merge(credits, on='id').merge(keywords, on='id')

we transform the genres column into a list of genre names. We also ensure the id column across datasets is of the same data type (int) for smooth merging. Finally, we merge the movies_dataset with credits and keywords datasets to form a comprehensive master_dataset.

Feature Engineering: The Secret Sauce

Feature engineering involves transforming raw data into features that better represent the underlying problem, improving the accuracy of machine learning models. In our case, we create a ‘soup’ feature that amalgamates all the significant elements of a movie into a single string.

master_dataset['cast'] = master_dataset['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x][:3])
master_dataset['director'] = master_dataset['crew'].apply(get_director).astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
master_dataset['soup'] = master_dataset['keywords'] + master_dataset['cast'] + master_dataset['director'] + master_dataset['genres']
master_dataset['soup'] = master_dataset['soup'].apply(lambda x: ' '.join(x))

Building the Recommendation Engine

Once our data is prepped and ready, we focus on the core of our project — the recommendation engine. This engine is based on the concept of content-based filtering.

Steps to Build the Engine:

  1. Count Vectorization: We use CountVectorizer to convert the soup of words into a matrix of token counts.
  2. Cosine Similarity: This metric helps us in determining how similar the movies are to each other based on their soup feature.
  3. Storing Results: We store the master dataset and similarity matrix using pyarrow, which is efficient for handling large datasets.
count = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=2, stop_words='english')
count_matrix = count.fit_transform(master_dataset['soup'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)

table = pa.Table.from_pandas(pd.DataFrame(cosine_sim))
pq.write_table(table, '/content/model.parquet')

With this, we finish building our recommender engine. It is saved to the model.parquet.

Integration with Django Web Application

The Django Setup

We have a Django project named movie_recommendation a single app called recommender. This structure is clean and typical for a focused Django application.

Project Structure

  • The movie_recommendation directory contains global settings and URLs.
  • The recommender app is where the magic happens—it's responsible for generating movie recommendations.
  • Static files, including our precomputed movie dataset and similarity matrix, reside in the static directory.

Deploying to Render.com

Before we begin deployment, create an account on render.com. Simple steps for creating an account are:

  1. Visit Render’s website and click on the “Sign Up” button.

2. Enter your email address, create a password, and agree to the terms of service.

3. Verify your email address by clicking on the link sent to your inbox.

4. Once verified, you will be redirected to your Render dashboard.

Then, once you are on your dashboard, select the top right-hand side toggle button to create a new web service.

Once you choose, it will ask you to choose your code base. As we have already pushed our code on GitHub, we will authorize it to our repository. If you have not logged in through GitHub, it will prompt you to provide authorization to all your GitHub repositories.

VERY VERY IMPORTANT STEP BEFORE YOU AUTHORIZE YOUR REPOSITORY.

You need to create a build.sh file at the root level which contains the following code.

#!/usr/bin/env bash
set -o errexit
pip install -r requirements.txt
python manage.py collectstatic --no-input

This is used by Render.com to build your application automatically. It has commands which need to be run in-order to run your application smoothly. So it installs all your dependencies and then collects all your static code files and model files, and then starts the final build to cloud deploy.

This file is located in my repository.

Currently, we do not use db.sqlite3 database file but it is also necessary to be available before the build. Once you complete this step, then you can proceed to connect your repository.

Once you select, you will be popped with a screen with inputs required before the final build. It contains things like the name of your web application, environments, build commands, cloud computing to use, etc. The most important of all of these are 3 things — The name of your web service (this will be shown on your dashboard and not the domain), the GitHub branch on which the code is there, and the build command. I have highlighted them here for example.

You are expected to choose instance type, but in our case, we are only using the free tier so we continue with that.

Once done, choose “Create Web Service” and you will be redirected to this page which shows the name of the web service, the URL on which your application will be hosted, and logs that show the current status of the application build.

The default option to deploy is automatically, triggered every time you make changes to the branch. With this, your application will be launched successfully. You can visit the URL seen on the dashboard.

Hooray!!! Congratulations on building your application. These steps can be used to build any Web Application using the Python-Django Framework. You may need to add slight tweaks accordingly, but the fundamentals will be the same.

The Final application can we checked HERE. (Be patient — the clound run is slow)

The complete code is on Github. Check HERE.

If you like it, make sure to follow me for more & upcoming awesome articles for free.

--

--

Praveen Kumar

Senior AI Engineer - Building Products for Organizations, gaining Experience, and sharing it out on social media.