Investigating TMDB Movie Datasets

Ayşe Bat
My Data Science Journey
8 min readFeb 22, 2019

In this story, I will investigate the TMDB movies dataset which is collected between 1960 to 2015 with the information of title, budget, revenue, cast, director, genres, release date, release year, runtime, etc …

The primary goal of the project is making the exploratory data analysis using numpy, pandas, seaborn and matplotlib library. For this, we need the clean the data first. Previously, we should ask a question and find the answers inside this datasets. So, this purpose will help us with the cleaning process.

The original data source comes from Kaggle

Questions to be Answered

What are all times highest and lowest profit movie?

What is all times top 10 movies which earn the highest profit?

What are the highest profit movie and the total profit for each year?

What is the all times highest and lowest budget movie?

What is all times top 10 movies which have the highest budget?

What are the highest budget movie and the total budget for each year?

What is the All times highest and lowest revenue movie?

What is all times top 10 movies which have the highest revenue?

What are the highest budget movie and the total budget for each year?

Which genres most used from 1960 to 2015?

Which cast were more filmed?

Which director was most filmed?

What is the Number of movies released in each month? What is the total profit by month?

Importing the Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

Loading the datasets

Githup repositories for this project is here.

df = pd.read_csv('data/tmdb_movies_data.csv')
#Let's check the dataset information
df.info()

There are 10866 columns and 21 columns.

  • “id”, “imdb_id” columns are similar column so we can get rid of “imdb_id” column which is not given any useful information for this analysis.
  • “popularity”, “budget”, and “revenue” columns are useful for this analysis and we are going to calculate the profit extract the revenue from the budget column. But previously we need the handle the missing values from budget and revenue column.
  • “original_title”, “cast”, “director” columns have useful information about the movies.
  • “homepage”, “tagline”, “keywords”, “overview”, “vote_average”, “budget_adj”, “revenue_adj” columns are not useful for analysis so these columns could be deleted from the data frame.
  • “release_date”, and “release_year” columns also important. And we need the convert the release_date column to pandas DateTime object.

Data Cleaning

  • Drop the duplicated rows.
  • Replace the values from ‘0’ to ‘NAN’ then, drop the rows which have missing values.
  • Change the format of release date into DateTime format.
  • Delete the unused columns from Data Frame
  • Checking the all columns are in the desired data type.
  • Calculating the profit extracting revenue from the budget.
#’duplicated()’ function return the duplicate row as True and othter as False
# using the sum() functions we can count the duplicate elements
sum(df.duplicated())
#Let's drop these row using 'drop_duplicates()' function
df.drop_duplicates(inplace=True)
#Changing Format Of Release Date Into Datetime Format
df['release_date'] = pd.to_datetime(df['release_date'])
df[['budget','revenue']] = df[['budget','revenue']].replace(0,np.NAN)
df.dropna(subset=['budget', 'revenue'], inplace=True)
del_col = ['imdb_id', 'homepage','tagline', 'keywords', 'overview','vote_average', 'budget_adj','revenue_adj']
df.drop(del_col, axis=1, inplace=True)
#Before answering the questions, lets figure out the profits of each movie
df['profit'] = df['revenue']-df['budget']
df['profit'] = df['profit'].apply(np.int64)
The columns have the null values and type of the columns

Exploratory Data Analysis

We will create the function to facilitate the answer the questions before going into exploratory data analysis.

This function is to find out the min and the max value of any given column. So, we can use this function on the budget, revenue, and profit to find out the highest and lowest values for given information.

The top_10 function calculates all times top 10 movies for any given columns, and also plot this information in a bar chart.

function to the highest value of a given column for last 15 movies

If we want to find out the total or highest value of any given column (budget, revenue, or profit) for each year separately then we could use the each_year_best function which is defined below. This function will plot the total and highest value of any given column for the last 15 years as default.

Using these functions on the budget, revenue, and profit columns let’s find out the answers we are looking for.

What are all times highest and lowest profit movie?

What is all times top 10 movies which earn the highest profit?

top_10('profit)

What are the highest profit movie and the total profit for each year?

each_year_best('profit')

What is the all times highest and lowest budget movie?

Yeap, you see correct the highest budget movie is the warrior’s way. This movie also has the lowest profit. That reason you probably didn’t hear the movie… 😢

What is all times top 10 movies which have the highest budget?

top_10('budget')

What are the highest budget movie and the total budget for each year?

each_year_best('budget')

What is the All times highest and lowest revenue movie?

What is all times top 10 movies which have the highest revenue?

top_10('revenue')

What are the highest budget movie and the total budget for each year?

each_year_best('revenue')

We are going to write another function to answer the following question. This function could take the column like genres, cast or director then count the values of these columns to find out more filmed genres or the cast or director more filmed in this time of period.

We are going to write a function to find out the most filmed genres, cast or director.

The splint_count_data function takes a column with the information which we want to count and find out the most being one in a given column then make it bar plot and pie chart with the percentage.

Which genres was more used from 1960 to 2015?

split_count_data(‘genres’)

Which cast were more filmed?

split_count_data(‘cast’,size=25)

Which director was most filmed?

split_count_data(‘director’)

Which production companies were the most filmed?

split_count_data('production_companies', size=20)

What is the Number of movies released in each month? What is the total profit by month?

We also look for popularity and vote count column using the top_10 function to see the most popular film and most counted film.

Let’s explore the popularity using the top_10 function, and the also investigate the vote_count to find out most voted movies in TMDB website.

top_10(‘popularity’, size=30)
top_10(‘vote_count’, size=30)

Let’s try the found out if there is any correlation between this variable.

df_related = df[['profit','budget','revenue','runtime', 'vote_count','popularity','release_year']]
sns.pairplot(df_related, kind='reg')

Let’s check out a few plots below:

  • 1. Budget vs Revenue: Budget and revenue both have a positive correlation between them. Means there is a good possibility that movies with higher investments result in better revenues.
  • 2. Profit Vs Budget: Profit And Budget both have a positive correlation between them. Means there is a good possibility that movies with higher investments result in better profit.
  • 3. Release Year Vs Vote count: Release year and vote Average have a negative correlation. Means that movie ratings (vote count) do not depend on the release year.
  • 4. Popularity Vs Profit: Popularity and profit have a positive correlation. It means that movie with high popularity tends to earn high profit.

Conclusion

We analysis the TMDB dataset which is collected between 1960 to 2015. Our goal here finding the answer utilizing this dataset. We could summaries this analysis result in the following items.
1- The most profitable movie is Avatar and filmed in 2009. Star Wars: The Force Awakers is second, and Titanic is the third one.
2- The last profitable movie is The Warrior’s Way and this movie also has the highest budget.
3- The most popular genres was filmed Drama, Comedy, and Action.
4- The most filmed actor was Robert De Niro, Bruce Wills and Samual L. Jackson.
5- The most filmed director was Steven Spielberg, Clint Eastwood, and Ridley Scott.
6- The most filmed production company was Universal Pictures, Waner Bros, and Paramount Pictures.
7- The most profitable mounts are June, December, and May.
8-According to TMDB dataset, all times most popular movies are Jurassic World, Mad Max: Fury Road, and Interstellar.
9-All times most voted movies are Inception, The Avengers and Avatar.
10- Revenue and budget both have a positive correlation between them.
11- There is a high probability that movies with higher investments result in better profit.

--

--