Movie Recommendation System

Kunal Bharadwaj

Published in

Analytics Vidhya

7 min readOct 25, 2019

Original work of Kunal Bhardwaj & Veera Vignesh (Blog)

Introduction

Movies have been a part of our culture for a long time now. It all started with the plays which dates back to 5th century BC. Short films which where projected during 1890’s is considered as the breakthrough in film industry. 20th century can be said as a catalyst for the growth of the industry where the movies and the technologies used evolved. Industry has been through many faces such as

Silent Era
Rise of Hollywood
Golden Era
Appearances of Blockbusters
Modern film industry

Now the industry has matured and has turned into a $ 40 Billion dollar industry with USA being the third largest behind China and India in terms of tickets sold.

USA has housed world famous production houses namely Warner Bros, Sony Motion Pictures, Walt Disney, Universal Pictures to name a few.

Problem Statement

Movies made by the production houses primarily aims at making there movies Likable and Profitable. Suppose production houses are interested in answering the following question

What are the factors to be considered to make a successful movie?

Objective

To analyze the factors affecting the success of the movie like gross, Facebook likes, critic reviews, IMDb Score etc.
To recommend suitable director, cast, plot based on the chosen genre to make our movie profitable

Data

To analyze the mentioned problem IMDB-5000-Movie-Dataset was obtained from data.world
Our data consist of 5048 movies from the year 1916 to 2016
Each observation represents the individual movie produced with various fields such as title, year, director, cast etc. with the total of 5048 rows and 28 columns.

Feature engineering

Genre column contains multiple values delimited with pipe operator (‘ | ’) excel was used to make them into individual columns. Top 3 Genres are only considered and named as genre_1, genre_2, genre_3. Now our data set contains 5048 rows and 30 columns

These modifications where done using excel . After initial modifications the data is loaded onto python using pandas for further analysis

Exploratory Data Analysis

# importing necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import statsmodels.api as stm# Loading the data into the dataframe

df = pd.read_csv('movie_metadata.csv')

# Displaying 5 samples of the dataset
df.sample(5)

5 rows × 28 columns

The Output of the describe() function is

#To understand about the missing values in the data

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB# To better understand the count of the missing values
df.isnull().sum().sort_values(ascending=False)gross                        884
budget                       492
aspect_ratio                 329
content_rating               303
plot_keywords                153
title_year                   108
director_name                104
director_facebook_likes      104
num_critic_for_reviews        50
actor_3_name                  23
actor_3_facebook_likes        23
num_user_for_reviews          21
color                         19
duration                      15
facenumber_in_poster          13
actor_2_name                  13
actor_2_facebook_likes        13
language                      12
actor_1_name                   7
actor_1_facebook_likes         7
country                        5
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64

Since Gross and Budget are important fields with many missing values.. Lets look into the distribution of it

# Distribution of the Gross
sns.distplot(df.loc[df.gross.isna()==False,'gross'],color='g')
plt.title('Distribution of Gross')
plt.xlabel('Gross in USD')
plt.ylabel('Frequency in Log')
plt.show()
print(f'Mean: {df.gross.mean():.2f}')
print(f'Median: {df.gross.median():.2f}')

Mean: 48468407.53
Median: 25517500.00# Distribution of the Budget
sns.distplot(df.loc[df.budget.isna()==False,'budget'],color='g')
plt.title('Distribution of budget')
plt.xlabel('Budget in USD')
plt.ylabel('Frequency in Log')
plt.show()

print(f'Mean: {df.budget.mean():.2f}')
print(f'Median: {df.budget.median():.2f}')

It is clear that both the distribution are highly skewed to the right and imputing median will be a better approach.
Since the data spans over a period of 100 years imputing values with the median of the entire series will be wrong as the money value changes over time.

Imputing with the median of the corresponding year will be a better approach

# Grouping by title_year and imputing gross and budget with median.
df.loc[df.gross.isnull(), 'gross'] = df.groupby('title_year')['gross'].transform('median')
df.loc[df.budget.isnull(), 'budget'] = df.groupby('title_year')['budget'].transform('median')df.isnull().sum()color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        130
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               303
budget                       100
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64# Dropping values with both gross and budget that is not available

df.drop(df.index[(df.gross.isna() == True) & 
    (df.budget.isna() == True)], inplace=True)df.shape(4946, 28)df.isnull().sum().sort_values(ascending=False)aspect_ratio                 309
content_rating               264
plot_keywords                140
num_critic_for_reviews        42
gross                         33
actor_3_facebook_likes        19
actor_3_name                  19
num_user_for_reviews          15
color                         15
facenumber_in_poster          13
duration                      12
actor_2_name                  11
title_year                    11
director_facebook_likes       11
actor_2_facebook_likes        11
director_name                 11
language                       9
actor_1_name                   7
actor_1_facebook_likes         7
budget                         3
country                        1
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64

Dropping these 97 observation actually reduced the number of null values in director_name and director_facebook_likes. Since the number of Null Values is within 1% of the total observation for prime factors we can proceed with our analysis

Hypothesis 1 : Is Gross of the movie related to budget ?

Let us first understand the overall trend of the budget vs gross for all the years.

1970s is rightly called as the golden of industry as we can see the number of movies that where produced exploded and the budget involved in movie production also increased drastically.

Now lets drill down into this.. Lets take our hypothesis on the year in which maximum number of movies where produced and later generalize it over all the years

print(f"In the year {df.groupby('title_year')['gross'].count().idxmax()}",end='')
print(f" there were about {df.groupby('title_year')['gross'].count().max()} movies released, which maximum as per our data")In the year 2009.0 there were about 260 movies released, which maximum as per our data# Plotting regplot for 2009
sns.regplot(df.loc[df.title_year==2009,'budget'],df.loc[df.title_year==2009,'gross'],scatter_kws={'alpha':0.3})
plt.title('Gross vs Budget for the 2009')
plt.show()

We can observe that there is linear relationship between the gross and the budget. This is only for the year with maximum number of movies of that year.

# Understanding the plot of all the movies over the years
plt.scatter(df.budget,df.gross,alpha=0.3)
plt.xscale('log')
plt.yscale('log')
plt.show()

From the above graph it is clear that our hypothesis holds because we can see a linear relationship between the Budget and the Gross.

Hypothesis 2: Best genre to make successful movie in the current era?

Before getting into current era lets explore the best movie genre of all time by plotting the highest gross movies and understanding the number of movies in individual genre.

The size of the tree represents the number of movies released in that particular genre and the color of the genre represents the Average gross of that particular genre.

From this we are able to interpret that even though the number of movies produced in the animation genre is less the gross obtained from it is about $80 M. Adventure contains a considerably large number of movies and the gross obtained on average is also $80 M

Now lets explore the trend of the current decade 2010–2016

From the graph it is clear that the average gross from the animation and family is higher and the number of movies released in the genre are less our production company can use this opportunity and capitalize on it.

Hypothesis 3: Common Plots of successful movies?

For this the field plot_keywords to arrive at the keywords which occur more often in the profitable movies of the particular genre. Based on the keywords frequency we can make a plot out of it.

Pipe operator separated plot_keywords where separated into individual fields. For all the profitable movies in that particular genre keywords where filtered out to make the Tableau dashboard