Movie Recommendation System

Kunal Bharadwaj
Analytics Vidhya
Published in
7 min readOct 25, 2019

Original work of Kunal Bhardwaj & Veera Vignesh (Blog)

Introduction

Movies have been a part of our culture for a long time now. It all started with the plays which dates back to 5th century BC. Short films which where projected during 1890’s is considered as the breakthrough in film industry. 20th century can be said as a catalyst for the growth of the industry where the movies and the technologies used evolved. Industry has been through many faces such as

  • Silent Era
  • Rise of Hollywood
  • Golden Era
  • Appearances of Blockbusters
  • Modern film industry

Now the industry has matured and has turned into a $ 40 Billion dollar industry with USA being the third largest behind China and India in terms of tickets sold.

USA has housed world famous production houses namely Warner Bros, Sony Motion Pictures, Walt Disney, Universal Pictures to name a few.

Problem Statement

Movies made by the production houses primarily aims at making there movies Likable and Profitable. Suppose production houses are interested in answering the following question

What are the factors to be considered to make a successful movie?

Objective

  • To analyze the factors affecting the success of the movie like gross, Facebook likes, critic reviews, IMDb Score etc.
  • To recommend suitable director, cast, plot based on the chosen genre to make our movie profitable

Data

  • To analyze the mentioned problem IMDB-5000-Movie-Dataset was obtained from data.world
  • Our data consist of 5048 movies from the year 1916 to 2016
  • Each observation represents the individual movie produced with various fields such as title, year, director, cast etc. with the total of 5048 rows and 28 columns.

Feature engineering

  • Genre column contains multiple values delimited with pipe operator (‘ | ’) excel was used to make them into individual columns. Top 3 Genres are only considered and named as genre_1, genre_2, genre_3. Now our data set contains 5048 rows and 30 columns

These modifications where done using excel . After initial modifications the data is loaded onto python using pandas for further analysis

Exploratory Data Analysis

# importing necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import statsmodels.api as stm
# Loading the data into the dataframe

df = pd.read_csv('movie_metadata.csv')

# Displaying 5 samples of the dataset
df.sample(5)

5 rows × 28 columns

The Output of the describe() function is

#To understand about the missing values in the data

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color 5024 non-null object
director_name 4939 non-null object
num_critic_for_reviews 4993 non-null float64
duration 5028 non-null float64
director_facebook_likes 4939 non-null float64
actor_3_facebook_likes 5020 non-null float64
actor_2_name 5030 non-null object
actor_1_facebook_likes 5036 non-null float64
gross 4159 non-null float64
genres 5043 non-null object
actor_1_name 5036 non-null object
movie_title 5043 non-null object
num_voted_users 5043 non-null int64
cast_total_facebook_likes 5043 non-null int64
actor_3_name 5020 non-null object
facenumber_in_poster 5030 non-null float64
plot_keywords 4890 non-null object
movie_imdb_link 5043 non-null object
num_user_for_reviews 5022 non-null float64
language 5031 non-null object
country 5038 non-null object
content_rating 4740 non-null object
budget 4551 non-null float64
title_year 4935 non-null float64
actor_2_facebook_likes 5030 non-null float64
imdb_score 5043 non-null float64
aspect_ratio 4714 non-null float64
movie_facebook_likes 5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
# To better understand the count of the missing values
df.isnull().sum().sort_values(ascending=False)
gross 884
budget 492
aspect_ratio 329
content_rating 303
plot_keywords 153
title_year 108
director_name 104
director_facebook_likes 104
num_critic_for_reviews 50
actor_3_name 23
actor_3_facebook_likes 23
num_user_for_reviews 21
color 19
duration 15
facenumber_in_poster 13
actor_2_name 13
actor_2_facebook_likes 13
language 12
actor_1_name 7
actor_1_facebook_likes 7
country 5
movie_facebook_likes 0
genres 0
movie_title 0
num_voted_users 0
movie_imdb_link 0
imdb_score 0
cast_total_facebook_likes 0
dtype: int64

Since Gross and Budget are important fields with many missing values.. Lets look into the distribution of it

# Distribution of the Gross
sns.distplot(df.loc[df.gross.isna()==False,'gross'],color='g')
plt.title('Distribution of Gross')
plt.xlabel('Gross in USD')
plt.ylabel('Frequency in Log')
plt.show()
print(f'Mean: {df.gross.mean():.2f}')
print(f'Median: {df.gross.median():.2f}')
Mean: 48468407.53
Median: 25517500.00
# Distribution of the Budget
sns.distplot(df.loc[df.budget.isna()==False,'budget'],color='g')
plt.title('Distribution of budget')
plt.xlabel('Budget in USD')
plt.ylabel('Frequency in Log')
plt.show()
print(f'Mean: {df.budget.mean():.2f}')
print(f'Median: {df.budget.median():.2f}')
  • It is clear that both the distribution are highly skewed to the right and imputing median will be a better approach.
  • Since the data spans over a period of 100 years imputing values with the median of the entire series will be wrong as the money value changes over time.

Imputing with the median of the corresponding year will be a better approach

# Grouping by title_year and imputing gross and budget with median.
df.loc[df.gross.isnull(), 'gross'] = df.groupby('title_year')['gross'].transform('median')
df.loc[df.budget.isnull(), 'budget'] = df.groupby('title_year')['budget'].transform('median')
df.isnull().sum()color 19
director_name 104
num_critic_for_reviews 50
duration 15
director_facebook_likes 104
actor_3_facebook_likes 23
actor_2_name 13
actor_1_facebook_likes 7
gross 130
genres 0
actor_1_name 7
movie_title 0
num_voted_users 0
cast_total_facebook_likes 0
actor_3_name 23
facenumber_in_poster 13
plot_keywords 153
movie_imdb_link 0
num_user_for_reviews 21
language 12
country 5
content_rating 303
budget 100
title_year 108
actor_2_facebook_likes 13
imdb_score 0
aspect_ratio 329
movie_facebook_likes 0
dtype: int64
# Dropping values with both gross and budget that is not available

df.drop(df.index[(df.gross.isna() == True) &
(df.budget.isna() == True)], inplace=True)
df.shape(4946, 28)df.isnull().sum().sort_values(ascending=False)aspect_ratio 309
content_rating 264
plot_keywords 140
num_critic_for_reviews 42
gross 33
actor_3_facebook_likes 19
actor_3_name 19
num_user_for_reviews 15
color 15
facenumber_in_poster 13
duration 12
actor_2_name 11
title_year 11
director_facebook_likes 11
actor_2_facebook_likes 11
director_name 11
language 9
actor_1_name 7
actor_1_facebook_likes 7
budget 3
country 1
movie_facebook_likes 0
genres 0
movie_title 0
num_voted_users 0
movie_imdb_link 0
imdb_score 0
cast_total_facebook_likes 0
dtype: int64

Dropping these 97 observation actually reduced the number of null values in director_name and director_facebook_likes. Since the number of Null Values is within 1% of the total observation for prime factors we can proceed with our analysis

Hypothesis 1 : Is Gross of the movie related to budget ?

Let us first understand the overall trend of the budget vs gross for all the years.

1970s is rightly called as the golden of industry as we can see the number of movies that where produced exploded and the budget involved in movie production also increased drastically.

Now lets drill down into this.. Lets take our hypothesis on the year in which maximum number of movies where produced and later generalize it over all the years

print(f"In the year {df.groupby('title_year')['gross'].count().idxmax()}",end='')
print(f" there were about {df.groupby('title_year')['gross'].count().max()} movies released, which maximum as per our data")
In the year 2009.0 there were about 260 movies released, which maximum as per our data# Plotting regplot for 2009
sns.regplot(df.loc[df.title_year==2009,'budget'],df.loc[df.title_year==2009,'gross'],scatter_kws={'alpha':0.3})
plt.title('Gross vs Budget for the 2009')
plt.show()

We can observe that there is linear relationship between the gross and the budget. This is only for the year with maximum number of movies of that year.

# Understanding the plot of all the movies over the years
plt.scatter(df.budget,df.gross,alpha=0.3)
plt.xscale('log')
plt.yscale('log')
plt.show()

From the above graph it is clear that our hypothesis holds because we can see a linear relationship between the Budget and the Gross.

Hypothesis 2: Best genre to make successful movie in the current era?

Before getting into current era lets explore the best movie genre of all time by plotting the highest gross movies and understanding the number of movies in individual genre.

The size of the tree represents the number of movies released in that particular genre and the color of the genre represents the Average gross of that particular genre.

From this we are able to interpret that even though the number of movies produced in the animation genre is less the gross obtained from it is about $80 M. Adventure contains a considerably large number of movies and the gross obtained on average is also $80 M

Now lets explore the trend of the current decade 2010–2016

From the graph it is clear that the average gross from the animation and family is higher and the number of movies released in the genre are less our production company can use this opportunity and capitalize on it.

Hypothesis 3: Common Plots of successful movies?

For this the field plot_keywords to arrive at the keywords which occur more often in the profitable movies of the particular genre. Based on the keywords frequency we can make a plot out of it.

Pipe operator separated plot_keywords where separated into individual fields. For all the profitable movies in that particular genre keywords where filtered out to make the Tableau dashboard

The entire analysis was then converted to Tableau Dashboard which will improve the usability of the report.

Movie Dashboard

For more details:

Github link

References

  1. Statista
  2. History of Film
  3. Box office

--

--

Kunal Bharadwaj
Analytics Vidhya

Pursuing Post Graduation in Data Science from Praxis Business School