Exploratory Data Analysis: IMDb Dataset

Ömer Faruk Eker
Analytics Vidhya
Published in
19 min readApr 7, 2020

In this notebook, we will explore IMDb’s dataset which is available online and refreshed daily.

https://www.imdb.com/interfaces/

Seven gzipped files (tab-separated values) can be downloaded from the website. Files are:

  • title.akas.tsv.gz
  • title.basics.tsv.gz
  • title.crew.tsv.gz
  • title.episode.tsv.gz
  • title.principals.tsv.gz
  • title.ratings.tsv.gz
  • name.basics.tsv.gz

We will go through each of them one by one except “title.episode.tsv.gz” as we are only interested in movies in the notebook, not TV series.

Let’s start with the necessary imports of the Python libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
plt.style.use("seaborn-pastel")

Loading huge tables from IMDb dataset into our notebook takes a while; hence, we will do it once and save them in .sav format which is allowed by pickle library of Python, this will speed up data loading processes

import pickle
# import glob
# save all tables one by one into separate sav files
# tsv_files = glob.glob("*.tsv")
# for file in tsv_files:
# print(file)
# pickle.dump(pd.read_table(file,sep="\t",low_memory=False, na_values=["\\N","nan"]),
# open(file[:-4]+".sav","wb"))

Let’s examine tables one by one. From time to time we will join tables to do extra analysis in the notebook

Ratings Table

title.ratings.tsv.gz — Contains the IMDb rating and votes information for titles

  • tconst (string) — alphanumeric unique identifier of the title
  • averageRating — weighted average of all the individual user ratings
  • numVotes — number of votes the title has received
df_ratings = pickle.load(open("title.ratings.sav","rb"))
df_ratings.head()

The ratings table contains over 1 million movie rating entries

df_ratings.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020911 entries, 0 to 1020910
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 1020911 non-null object
1 averageRating 1020911 non-null float64
2 numVotes 1020911 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 23.4+ MB

The average rating is 6.89 (mean) which is different from the median value (7.1), we will examine this further. The average number of votes per film is close to a thousand (mean), this time median (20) is significantly different than the mean

df_ratings.describe()

Average rating distribution shows a classic negative skewed distribution where the median is larger than mean. Vote count distribution is heavily clustered around small values (0–1000 votes). Therefore we need to do a bit extra to visualise this packed distribution

ratings = dict(mean=df_ratings.averageRating.mean(),
median=df_ratings.averageRating.median())
votes = dict(mean=df_ratings.numVotes.mean(),
median=df_ratings.numVotes.median())
plt.figure(figsize=(15,5))plt.subplot(1,2,1)
ax1 = sns.distplot(df_ratings.averageRating,kde_kws=dict(bw=0.2))
ax1.axvline(x=ratings["mean"],c=sns.color_palette("Set2")[1],label=f"mean={ratings['mean'].round(2)}")
ax1.axvline(x=ratings["median"],c=sns.color_palette("Set2")[2],label=f"median={ratings['median'].round(2)}")
plt.legend()
plt.subplot(1,2,2)
ax2 = sns.distplot(df_ratings.numVotes,kde_kws=dict(bw=0.2))
ax2.axvline(x=votes["mean"],c=sns.color_palette("Set2")[1],label=f"mean={votes['mean'].round(2)}")
ax2.axvline(x=votes["median"],c=sns.color_palette("Set2")[2],label=f"median={votes['median'].round(2)}")
plt.legend()
plt.tight_layout()
plt.show()

Let’s try to use pandas.qcut function, which discretizes the variable into equal-sized buckets based on rank or based on sample quantiles. Let’s divide our data into 20 buckets and visualise:

  • Over 140K samples (more than 10%) have only 5 votes counted
  • The films having more than 1095 votes are represented with a single bar in the plot which counts around 50K (these are the significant/popular films we are mostly interested in)
buckets = 20
plt.figure(figsize=(15,6))
bins = pd.qcut(df_ratings.numVotes,buckets,duplicates="drop").value_counts()
sns.barplot(x=bins.values,y=bins.index,orient="h")
plt.show()

Also, we can consider plotting the distribution on a logarithmic scale which reveals more. Since it is a count variable, it looks more like a Poisson distribution

plt.figure(figsize=(15,6))
ax=sns.distplot(df_ratings.numVotes,kde=False)
ax.set_ylabel("Count")
ax.set_yscale("log")

Title Basics Table

title.basics.tsv.gz — Contains the following information for titles:

  • tconst (string) — alphanumeric unique identifier of the title
  • titleType (string) — the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
  • primaryTitle (string) — the more popular title / the title used by the filmmakers on promotional materials at the point of release
  • originalTitle (string) — original title, in the original language
  • isAdult (boolean) — 0: non-adult title; 1: adult title
  • startYear (YYYY) — represents the release year of a title. In the case of TV Series, it is the series start year
  • endYear (YYYY) — TV Series end year. ‘\N’ for all other title types
  • runtimeMinutes — primary runtime of the title, in minutes
  • genres (string array) — includes up to three genres associated with the title
df_title_basics = pickle.load(open("title.basics.sav","rb"))
df_title_basics.head()

Title basics table is way larger than the ratings table with over 6 million entries each of which describes basic information about the video titles. I used the word “video” as it not only includes films but also tv series, short videos (short films and music clips), even video games

df_title_basics.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6682995 entries, 0 to 6682994
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 titleType object
2 primaryTitle object
3 originalTitle object
4 isAdult int64
5 startYear float64
6 endYear float64
7 runtimeMinutes object
8 genres object
dtypes: float64(2), int64(1), object(6)
memory usage: 458.9+ MB

Let’s explore columns one by one. Title type distribution is as follows:

  • Majority of the titles are TV episodes (71%)
  • Movies (including TV movies) only cover 10% of the dataset
df_title_basics.titleType.value_counts().plot.pie(autopct="%.0f%%",figsize=(6,6),pctdistance=0.8,
wedgeprops=dict(width=0.4))
plt.show()

Let’s trim dataset as we are interested only in the movies. After trimming is done the majority of the titles now are movies (82%) and the remaining 120K titles are TV movies

df_title_basics = df_title_basics[df_title_basics.isAdult == 0]
df_title_basics.drop(["isAdult","endYear"],axis=1,inplace=True)
df_title_basics = df_title_basics[(df_title_basics.titleType == "movie") | (df_title_basics.titleType == "tvMovie")]
df_title_basics.titleType.value_counts().plot.pie(autopct="%.0f%%",figsize=(6,6),pctdistance=0.8,
wedgeprops=dict(width=0.4))
plt.show()

How about the distribution of Genres, the pie chart below doesn’t look charming right? It is because movies have multiple genres, therefore what we see below is a combination of genres each representing a slice in the pie

df_title_basics.genres.value_counts().plot.pie(autopct="%.0f%%",figsize=(6,6),pctdistance=0.8,
wedgeprops=dict(width=0.4))
plt.show()

To overcome this, we can use Scikit-Learn’s CountVectorizer feature extraction technique to detect and count each unique genre (e.g. drama, comedy, etc). We will create a new column for each unique Genre title and it will be True/False if a movie has that genre or not

from sklearn.feature_extraction.text import CountVectorizertemp = df_title_basics.genres.dropna()
vec = CountVectorizer(token_pattern='(?u)\\b[\\w-]+\\b', analyzer='word').fit(temp)
bag_of_genres = vec.transform(temp)
unique_genres = vec.get_feature_names()
np.array(unique_genres)

Unique genres:

array(['action', 'adult', 'adventure', 'animation', 'biography', 'comedy', 'crime', 'documentary', 'drama', 'family', 'fantasy', 'film-noir', 'game-show', 'history', 'horror', 'music', 'musical', 'mystery', 'news', 'reality-tv', 'romance', 'sci-fi', 'sport', 'talk-show', 'thriller', 'war', 'western'], dtype='<U11')

It did very well by finding unique genres and counting them. Note that, in the bar chart below, since a movie may have multiple genres at a time, their count will not be 100%

genres = pd.DataFrame(bag_of_genres.todense(),columns=unique_genres,index=temp.index)
sorted_genres_perc = 100*pd.Series(genres.sum()).sort_values(ascending=False)/genres.shape[0]
plt.figure(figsize=(15,8))
sns.barplot(x=sorted_genres_perc.values,y=sorted_genres_perc.index,orient="h")
plt.xlabel("Percentage of Films (%)")
plt.show()

How about the trend for the number of voters per year and voter counts per year/per film?

The graph below contains two subplots, the former gives the total number of films made each year, and the latter gives a total number of voters for the films made at the corresponding year which was peaked in 2013 (37 million vote count total for the films released in 2013) and follows a dramatic decline in the count. This is where things get interesting. The number of films made per year increases until 2017 but why people stop voting recent films? Is it because of the quality of movies getting worse? Or is it the IMDb website losing its popularity? However, it is obvious that IMDb still dominates the online film rating sector as their mind-blowing monthly visitor number reaches to 250 million.

[read more: https://www.businesswire.com/news/home/20180222005150/en/IMDb-Launches-First-Ever-Skill-Amazon-Alexa]

merged_temp = pd.merge(df_ratings,df_title_basics,on="tconst",how="left")
merged_temp = merged_temp[(merged_temp.startYear.notnull())&(merged_temp.startYear<2020)]
counts_yearly = merged_temp.groupby("startYear").agg({"averageRating":[np.median],
"numVotes":[np.sum,np.size,lambda x: np.sum(x)/np.size(x)]})
max_count_year = counts_yearly[("numVotes","sum")].idxmax().astype(int)
max_year = counts_yearly[("numVotes","size")].idxmax().astype(int)
plt.figure(figsize=(15,5))plt.subplot(1,2,1)
ax =counts_yearly[("numVotes","size")].plot()
ax.annotate(max_year,xy=(max_year,counts_yearly[("numVotes","size")].max()),
xytext=(1980,10000), arrowprops=dict(color="sandybrown",shrink=0.05,width=1))
ax.annotate("WW I",xy=(1914,counts_yearly[("numVotes","size")].loc[1914]), xytext=(1900,2000),
arrowprops=dict(color="sandybrown",shrink=0.05,width=1))
ax.annotate("WW II",xy=(1939,counts_yearly[("numVotes","size")].loc[1939]), xytext=(1950,4000),
arrowprops=dict(color="sandybrown",shrink=0.05,width=1))
plt.title("Total Number Films per Year",fontweight="bold")
plt.subplot(1,2,2)
ax =counts_yearly[("numVotes","sum")].plot()
ax.annotate(max_count_year,xy=(max_count_year,counts_yearly[("numVotes","sum")].max()),
xytext=(1960,3e7),arrowprops=dict(shrink=0.05,color="sandybrown",width=2))
plt.title("Total Number of Voters per Year",fontweight="bold")
plt.show()

This could be another research point, let’s continue to find more interesting points in the data.

The graph below has two subplots again. The former gives the average rating of films per year, whereas the latter displays the average voters per film again on a yearly basis. Since the 1920s, average film ratings tend to fluctuate but not showing a monotonic increase/decrease trend.

The latter graph reaches its peak in the 90s and early 2000s and has been dropping dramatically since then. Recall that the total number of voters per year was peaked in 2013, not 90s. This means that 90s films got the attention of users the most. This can be related to the dominance of the age group of the IMDb voters. Since we don’t have that information, we can guess that the dominant age group must be in the range of 30 to 50. 90s films are their childhood or teenage times films which you get impressed the most of a film huh? There could be other factors like the 90s films are better etc.

max_count_year_per_film = counts_yearly[("numVotes","<lambda_0>")].idxmax().astype(int)plt.figure(figsize=(15,5))plt.subplot(1,2,1)
ax =counts_yearly[("averageRating","median")].plot()
plt.title("Average Rating per Year",fontweight="bold")
plt.subplot(1,2,2)
ax = counts_yearly[("numVotes","<lambda_0>")].plot()
ax.annotate(max_count_year_per_film,xy=(max_count_year_per_film,counts_yearly[("numVotes","<lambda_0>")].max()),
xytext=(1960,5200),arrowprops=dict(shrink=0.05,color="sandybrown",width=2))
plt.title("Average Number of Voters per Year per Film",fontweight="bold")
plt.show()

Let’s now visualise the distribution of film runtimes in minutes. The single bar below is an indication that there are a few outlier films which have a runtime over 50000 minutes

sns.distplot(df_title_basics.runtimeMinutes.dropna().astype(int),bins=50)
plt.gca().annotate("857\nhours of\nruntime?",xy=(51000,0.00005),xytext=(40000,0.0004),
fontsize=20, ha="center",
arrowprops=dict(color="sandybrown",width=1))
plt.show()

35 days (857 hours) long film? Let’s find out which film it is

import warnings
warnings.filterwarnings("ignore")
use = df_title_basics[df_title_basics.runtimeMinutes.notnull()]
use["runtimeMinutes"] = use.runtimeMinutes.astype(int)
use[use.runtimeMinutes>50000]

It is Logistics (2012), longest made documentary ever. A 72 minutes long edit can be seen in the following YouTube link:

https://www.youtube.com/watch?v=QYFG0xP12yE

Many may not consider it a form of art, rather a camera placed on top of a container ship and kept shooting it 35 days. Similarly, Beijing 2003 and Modern Times Forever are shoot with similar purposes

use.sort_values(by="runtimeMinutes",ascending=False).head()

In order to see a clear distribution of “films”, we need to take these outliers off from the data and plot the rest. We will do that by focusing only on the movies with 300 minutes runtime or less:

  • Now it looks like a normal distribution centered around 90 minutes
rt = use.runtimeMinutes[use.runtimeMinutes<300]
mean_rt,median_rt,mode_rt = rt.mean(),rt.median(),rt.mode()[0]
plt.figure(figsize=(15,5))
sns.distplot(rt,kde_kws=dict(bw=10))
plt.gca().axvline(mean_rt,label="mean",color=sns.color_palette("Set2")[1],ymax=0.1)
plt.gca().axvline(median_rt,label="median",color=sns.color_palette("Set2")[2],ymax=0.2)
plt.gca().axvline(mode_rt,label="mode",color=sns.color_palette("Set2")[3],ymax=0.3)
plt.text(mean_rt+2,0.0025,f"Mean: {int(mean_rt)}")
plt.text(median_rt+2,0.006,f"Median: {int(median_rt)}")
plt.text(mode_rt+2,0.0085,f"Mode: {int(mode_rt)}")
plt.legend()
plt.show()

Now list the Top 20 movies with the highest voter count, first we need to merge df_ratings and df_title_basics tables and sort the table according to the number of votes:

  • Popular movies like Fight Club, The Matrix, Lord of the Rings trio are made the list as expected
  • Except for The Godfather, Top20 voted films are from the 90s to the present
merged = pd.merge(df_ratings,df_title_basics,on="tconst",how="right").sort_values(by="numVotes",ascending=False)
merged[["numVotes","primaryTitle","startYear"]].iloc[:20,:]

Let’s find out the Top 20 highest-rated films, we will use the merged table again by just sorting it with averageRating. Like Top250 top-rated films of the IMDb website we will put a precondition to avoid listing “unknown” high rated films. Therefore we will only take into consideration of films which received ratings from at least 25000 users:

  • It is still not the same as Top250 of IMDb website since a Turkish film called “Hababam Sinifi” tops our list with 9.4 average rating
  • This was because their Top250 list is ranked by a formula which includes the number of ratings each movie received from users, and value of ratings received from regular users
  • Since we do not have regular users information, we cannot filter them
merged[merged.numVotes>25000].sort_values(by="averageRating",ascending=False).head(20)

Let’s now list the Worst rated 20 films:

  • Again three Turkish films top the list, Turkish users have been very active huh?
  • Some of the read bad films are Disaster Movie, Supebabies, Epic Movie, Meet the Spartans, etc.
merged[merged.numVotes>25000].sort_values(by="averageRating",ascending=True).head(20)

Let’s visualise average ratings (median) of film genres. We will use the same counter which we used above:

  • Highest rating average is taken by documentaries and news categories (both political)
  • horror and sci-fi are in the bottom of the list (could be due to too many nonsense examples in both)
merged_temp = merged[merged.genres.notnull()]
vec = CountVectorizer(token_pattern='(?u)\\b[\\w-]+\\b', analyzer='word').fit(merged_temp.genres)
bag_of_genres = pd.DataFrame(vec.transform(merged_temp.genres).todense(),
columns=vec.get_feature_names(),index=merged_temp.index)
merged_temp = pd.concat([merged_temp,bag_of_genres],axis=1)
rating_counts_means = pd.DataFrame([[merged_temp.averageRating[merged_temp[i]==1].median(),merged_temp[i].sum()]
for i in vec.get_feature_names()],columns=["median","count"],index=vec.get_feature_names()).sort_values("median",ascending=False)
plt.figure(figsize=(7,8))
sns.barplot(y=rating_counts_means.index,x=rating_counts_means["median"],orient="h")
for i,counts in enumerate(rating_counts_means["count"]):
plt.text(0.5,i+0.25,f"{counts:>5} films")
plt.text(rating_counts_means["median"][i],i+0.25,rating_counts_means["median"][i])
plt.show()

Do users tend to rate higher on movies with longer runtime? Let’s find out:

  • Runtimes of movies are grouped using pandas’ qcut (bins are chosen automatically by qcut to distribute an equal number of samples for each group)
  • Outlier films are not visualised in the boxplot
  • Boxplot medians (vertical lines in each bin) show that the average rating tends to increase when movie runtime increases
  • The reason that films with 84m and less runtime being exception could be that this group contains a lot of animation films which have high rates
use = merged_temp[merged_temp.numVotes>1000]
use["runtimeMinutes"] = pd.to_numeric(use.runtimeMinutes)
[groups,edges] = pd.qcut(use.runtimeMinutes,10,precision=0,retbins=True)
ratings_avg = use.groupby(groups).agg({"averageRating":np.median})
sns.boxplot(y=groups,x="averageRating",data=use,orient="h",showfliers=False)
for i,rate in enumerate(ratings_avg["averageRating"]):
plt.text(rate+0.1,i+0.2,rate)
plt.show()

Name Basics Table

name.basics.tsv.gz — Contains the following information for names:

  • nconst (string) — alphanumeric unique identifier of the name/person
  • primaryName (string)– name by which the person is most often credited
  • birthYear — in YYYY format
  • deathYear — in YYYY format if applicable, else ‘\N’
  • primaryProfession (array of strings)– the top-3 professions of the person
  • knownForTitles (array of tconsts) — titles the person is known for
df_name_basics = pickle.load(open("name.basics.sav","rb"))
df_name_basics.head()

The table has almost 10 million names coming from different professions such as actors, actresses, directors, writers, etc. It contains their birth and death year info as well as their most known work of art

df_name_basics.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9991422 entries, 0 to 9991421
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 nconst object
1 primaryName object
2 birthYear float64
3 deathYear float64
4 primaryProfession object
5 knownForTitles object
dtypes: float64(2), object(4)
memory usage: 457.4+ MB

The birth year distribution of the persons in the dataset is displayed below. The dataset contains even ancient writers from year 4 A.D.

sns.distplot(df_name_basics.birthYear.dropna(),kde=False)
plt.show()

The Top 10 ancient persons in the dataset:

  • Lucio Anneo Seneca was born in 4AD being the most ancient person in the dataset
  • Some entries have wrong birth/death year info such as negative lifespans or extreme lifespans (e.g. George Sanger birth 23AD, death 1911)
df_name_basics.sort_values("birthYear").head(10)

Distribution of lifespans of persons in the dataset:

  • Shows a negative skew where mode>median>mean which is expected for distributions showing age at death
use = df_name_basics[["birthYear","deathYear","primaryName"]].dropna()
use["lifespan"] = use.deathYear - use.birthYear
use["lifespan"][(use.lifespan>200)|(use.lifespan<0)]=use.lifespan.median()
plt.figure(figsize=(12,5))
ax = sns.distplot(use.lifespan)
ax.axvline(use.lifespan.mode()[0],label=f"mode age: {int(use.lifespan.mode()[0])}",color="forestgreen")
ax.axvline(use.lifespan.median(),label=f"median age: {int(use.lifespan.median())}",color="sandybrown")
ax.axvline(use.lifespan.mean(),label=f"mean age: {int(use.lifespan.mean())}",color="fuchsia")
plt.legend()
plt.show()

Let’s find out Top 10 persons who have the longest lifespan in the dataset:

  • Jeanne Louise Calment, died when she was 122 years of age, had longest confirmed human life span in history (https://en.wikipedia.org/wiki/Jeanne_Calment)
  • Even though she did not take part in any movie whatsoever, she appeared in a documentary about her, which is why she is in our dataset
use.sort_values("lifespan",ascending=False).head(10)

Title Crew Table

title.crew.tsv.gz — Contains the director and writer information for all the titles in IMDb. Fields include:

  • tconst (string) — alphanumeric unique identifier of the title
  • directors (array of nconsts) — director(s) of the given title
  • writers (array of nconsts) — writer(s) of the given title
df_title_crew = pickle.load(open("title.crew.sav","rb"))
df_title_crew.head()

The table consists of 6.6 million entries

df_title_crew.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6682995 entries, 0 to 6682994
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 directors object
2 writers object
dtypes: object(3)
memory usage: 153.0+ MB

Some films have a large number of writers and directors. To analyse it, we need to combine three tables (title ratings, title crew, title basics)

merged = pd.merge(pd.merge(df_title_basics,df_title_crew,on="tconst",how="left"),df_ratings,on="tconst",how="left")
merged.head(10)

We will be able to do analysis on 591K films for their directors and 492K films for their writers as the merged table have those numbers of non-null rows

merged.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 658608 entries, 0 to 658607
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 658608 non-null object
1 titleType 658608 non-null object
2 primaryTitle 658608 non-null object
3 originalTitle 658608 non-null object
4 startYear 592644 non-null float64
5 runtimeMinutes 419983 non-null object
6 genres 251408 non-null object
7 directors 591101 non-null object
8 writers 492270 non-null object
9 averageRating 285470 non-null float64
10 numVotes 285470 non-null float64
dtypes: float64(3), object(8)
memory usage: 60.3+ MB
  • The pie chart below says that it is not as common to have multiple directors (8%) for a film as compared to multiple writers (44%)
  • The vast majority of films (92%) have a single director, almost 1/3 of films have a writer duo
writer_counts = merged.writers.dropna().apply(lambda x: len(x.split(",")))
writer_counts[writer_counts>4] = "5+"
director_counts = merged.directors.dropna().apply(lambda x: len(x.split(",")))
director_counts[director_counts>2] = "3+"
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
writer_counts.value_counts().plot.pie(autopct="%.0f%%",pctdistance=0.8,
wedgeprops=dict(width=0.4))
plt.title("Number of Writers",fontweight="bold")
plt.ylabel(None)
plt.subplot(1,2,2)
director_counts.value_counts().plot.pie(autopct="%.0f%%",pctdistance=0.8,
wedgeprops=dict(width=0.4))
plt.title("Number of Directors",fontweight="bold")
plt.ylabel(None)
plt.show()

Some extreme number of writers for a movie:

  • Ted Bundy Had a Son have 62 writers
  • 50 kisses producers promotes the film with the following motto: “50 writers. 50 filmmakers. One picture.”
writer_counts = merged.writers.dropna().apply(lambda x: len(x.split(",")))
temp = merged[["startYear","primaryTitle","writers"]].loc[writer_counts.nlargest(5).index]
temp["Writer Count"] = writer_counts.nlargest(5)
temp

Also top films with a high number of directors:

  • Ted Bundy Had a Son again becomes the top film now with staggering 88 directors
director_counts = merged.directors.dropna().apply(lambda x: len(x.split(",")))
temp = merged[["startYear","primaryTitle","directors"]].loc[director_counts.nlargest(5).index]
temp["Director Count"] = director_counts.nlargest(5)
del merged, merged_temp
temp

Now let’s do some analysis on the success of directors. We need to make our own success criteria first. The first criteria came to my mind is the average rating score of a director. Again we will need to put some thresholds to eliminate unknown local directors. Thresholds:

  • Films with 25000 and more votes are accepted only
  • Directors with 3 or more films are accepted only
  • Directors with a median 100K voter counts accepted only

We are looking to find out successful directors who have an international impact. To do this we will need to combine all four tables which we examined until now.

director_temp = df_title_crew.drop("writers",axis=1)
director_temp.columns = ["tconst","nconst"]
director_temp.nconst = director_temp.nconst.dropna().apply(lambda x: x.split(",")[0])
directors = pd.merge(pd.merge(pd.merge(df_title_basics,df_ratings,on="tconst"),
director_temp,on="tconst"),
df_name_basics[["nconst","primaryName"]],on="nconst")
del director_temp
directors.head()

Tables are joined (inner) and the sample size of our new table became 280K rows. It contains 280K films with their director information

directors.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 280191 entries, 0 to 280190
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 280191 non-null object
1 titleType 280191 non-null object
2 primaryTitle 280191 non-null object
3 originalTitle 280191 non-null object
4 startYear 280182 non-null float64
5 runtimeMinutes 248302 non-null object
6 genres 116922 non-null object
7 averageRating 280191 non-null float64
8 numVotes 280191 non-null int64
9 nconst 280191 non-null object
10 primaryName 280191 non-null object
dtypes: float64(2), int64(1), object(8)
memory usage: 25.7+ MB

Now let’s group them and apply our thresholds to reveal the most successful 10 directors:

  • The table lists the top 10 directors, ordered by their median film ratings. Also, median vote counts per film, as well as the director’s film counts, are given in the last column
  • Christopher Nolan takes the first place in the list as he deserves, also my favorite director of all times
  • The list is full of known and successful directors, no local guys around :)
director_success = directors[directors.numVotes>25000].groupby("primaryName").\
agg({"averageRating":[np.median],"numVotes":[np.median],"nconst":[np.size]}).\
sort_values(("averageRating","median"),ascending=False)
director_success[(director_success[("numVotes","median")]>100000)&(director_success[("nconst","size")]>3)].\
sort_values(("averageRating","median"),ascending=False).head(10)

Let’s visualise directors with our criteria in a scatter plot:

  • Font and colour of the director is proportional to their average rating
  • Circle diameter of the director is proportional to their total number of films
  • The x-axis is displayed in log scale to avoid tightly packed dots
  • Directors on top of the y-scale have higher rating averages whereas the directors in the right-hand side of the graph have more popular films
  • Christopher Nolan sits on the top right of the graph, one can say his films are both popular as well as high quality
  • Charles Chaplin sits on the top left. This is because his films have the highest rating average, but not popular. I guess that’s again due to the dominance of the voter age group
directors_successful = director_success[(director_success[("numVotes","median")]>150000)&\
(director_success[("nconst","size")]>4)].sort_values(("averageRating","median"),ascending=False)
plt.figure(figsize=(17,15))
sns.scatterplot(y=directors_successful[("averageRating","median")],x=directors_successful[("numVotes","median")],
s=directors_successful[("nconst","size")]*25,hue=directors_successful.index, legend=False,
)
plt.xlabel("Average Number of Votes of Director's Films")
plt.ylabel("Average Rating of Director's Films")
plt.gca().set_xscale("log")
for names in directors_successful.T:
plt.text(directors_successful.T[names][("numVotes","median")],
directors_successful.T[names][("averageRating","median")],
names,rotation=np.random.randint(0,45), rotation_mode="anchor",
fontsize=directors_successful.T[names][("averageRating","median")]**1.3)

How about writers? Let’s do the same analysis for them:

  • Now Jonathan Nolan, younger brother of Christopher Nolan takes the lead as he is the co-writer for many of his brother’s films
  • Christopher Markus wrote many recent superhero movies, which is why he finds his spot more on the popular side (right)
  • Andrew Stanton also wrote many successful animation series like Finding Nemo, Wall-E, Toy Story series
writer_temp = df_title_crew.drop("directors",axis=1)
writer_temp.columns = ["tconst","nconst"]
writer_temp.nconst = writer_temp.nconst.dropna().apply(lambda x: x.split(",")[0])
writers = pd.merge(pd.merge(pd.merge(df_title_basics,df_ratings,on="tconst"),
writer_temp,on="tconst"),
df_name_basics[["nconst","primaryName"]],on="nconst")
writer_success = writers[writers.numVotes>25000].groupby("primaryName").\
agg({"averageRating":[np.median],"numVotes":[np.median],"nconst":[np.size]}).\
sort_values(("averageRating","median"),ascending=False)
writers_successful = writer_success[(writer_success[("numVotes","median")]>150000)&\
(writer_success[("nconst","size")]>=4)].sort_values(("averageRating","median"),ascending=False)
plt.figure(figsize=(17,15))
sns.scatterplot(y=writers_successful[("averageRating","median")],x=writers_successful[("numVotes","median")],
s=writers_successful[("nconst","size")]*25,hue=writers_successful.index, legend=False,
)
plt.xlabel("Average Number of Votes of Writer's Films")
plt.ylabel("Average Rating of Writer's Films")
plt.gca().set_xscale("log")
for names in writers_successful.T:
plt.text(writers_successful.T[names][("numVotes","median")],
writers_successful.T[names][("averageRating","median")],
names,rotation=np.random.randint(0,45), rotation_mode="anchor",
fontsize=writers_successful.T[names][("averageRating","median")]**1.3)
del writer_temp

Title Principals Table

title.principals.tsv.gz — Contains the principal cast/crew for titles

  • tconst (string) — alphanumeric unique identifier of the title
  • ordering (integer) — a number to uniquely identify rows for a given titleId
  • nconst (string) — alphanumeric unique identifier of the name/person
  • category (string) — the category of job that person was in
  • job (string) — the specific job title if applicable, else ‘\N’
  • characters (string) — the name of the character played if applicable, else ‘\N’
df_title_principals = pickle.load(open("title.principals.sav","rb"))
df_title_principals.head(10)

It is the largest table amongst other IMDb table with a size of almost 2GB. It contains over 38 million principal cast/crew members such as actors/actresses, writer, director, producer, editor, etc. which are associated with each film.

df_title_principals.category.value_counts().plot.pie(autopct="%.0f%%", pctdistance=0.8, figsize=(7,7),
wedgeprops=dict(width=0.4))
df_title_principals.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38588696 entries, 0 to 38588695
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 ordering int64
2 nconst object
3 category object
4 job object
5 characters object
dtypes: int64(1), object(5)
memory usage: 1.7+ GB

Let’s find out who has been the busiest in film industry:

  • Ilaiyaraaja and William Shakespeare are the contributors with the highest numbers
  • Brahmanandam was involved more than a thousand films as an actor
  • Other than Shakespeare, not much-known personalities are there
inxs = df_title_principals.tconst.isin(df_title_basics.tconst)
use = df_title_principals[inxs]
top_names = use.nconst.value_counts().head(20)
top_names = pd.DataFrame(list(zip(top_names.index,top_names.values)),columns=["nconst","count"])
top_names = pd.merge(top_names,df_name_basics[["nconst","primaryName"]],on="nconst")
top_names["job_type"] = [use[use["nconst"] == i].category.value_counts().index[0] for i in top_names.nconst]
top_names

Title Akas Table

title.akas.tsv.gz — Contains the following information for titles:

  • titleId (string) — a tconst, an alphanumeric unique identifier of the title
  • ordering (integer) — a number to uniquely identify rows for a given titleId
  • title (string) — the localized title
  • region (string) — the region for this version of the title
  • language (string) — the language of the title
  • types (array) — Enumerated set of attributes for this alternative title. One or more of the following: “alternative”, “dvd”, “festival”, “tv”, “video”, “working”, “original”, “imdbDisplay”. New values may be added in the future without warning
  • attributes (array) — Additional terms to describe this alternative title, not enumerated
  • isOriginalTitle (boolean) — 0: not original title; 1: original title
df_title_akas = pickle.load(open("title.akas.sav","rb"))
df_title_akas.columns = ['tconst', 'ordering', 'title', 'region', 'language', 'types',
'attributes', 'isOriginalTitle']
df_title_akas.head()

The table consist of the title’s also known as (AKA) information such as the variations of title in different countries. There are 21 million rows in the table.

df_title_akas.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21116454 entries, 0 to 21116453
Data columns (total 8 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 ordering int64
2 title object
3 region object
4 language object
5 types object
6 attributes object
7 isOriginalTitle float64
dtypes: float64(1), int64(1), object(6)
memory usage: 1.3+ GB

We are only interested in movies, therefore AKAs table is inner joined with the title basics table

df_title_akas = pd.merge(df_title_akas,df_title_basics,on="tconst")
df_title_akas.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1971524 entries, 0 to 1971523
Data columns (total 14 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 ordering int64
2 title object
3 region object
4 language object
5 types object
6 attributes object
7 isOriginalTitle float64
8 titleType object
9 primaryTitle object
10 originalTitle object
11 startYear float64
12 runtimeMinutes object
13 genres object
dtypes: float64(2), int64(1), object(11)
memory usage: 225.6+ MB

Let’s now convert region codes into country names first. Then count them and visualise the Top 20:

  • The count does not give a total number of cinemas or tickets, it gives how many different numbers of films showed in cinemas in the country (the higher the number the more various films shown)
  • The US clearly wins the competition
  • Greece and Hungary find themselves in the middle of the table, better than I would expect, interesting
import pycountry
country_counts = df_title_akas.region.value_counts()
country_names = [pycountry.countries.get(alpha_2=coun).name for coun in country_counts.drop(["XWW","XWG"]).index[:20]]
counts = country_counts.drop(["XWW","XWG"]).values[:20]
plt.figure(figsize=(12,5.5))
sns.barplot(x=counts,y=country_names,orient="h")
plt.xlabel("Total Number of Different Films Shown in Cinemas")
for i,count in enumerate(counts):
plt.text(100,i,count,va="center")

Combine All Tables (for Movies only)

Now let’s combine all tables by selecting only movie titles. In this way, we will reduce the dataset size and save it for future uses. It will be much faster to load and play with the final table:

df_movies = pd.merge(pd.merge(pd.merge(pd.merge(df_title_basics,df_ratings,on="tconst"),
df_title_crew,on="tconst"),
df_title_principals,on="tconst"),
df_name_basics,on="nconst")
df_movies.head()

Create a file called “df_movies.csv” and save the final table in .csv format

df_movies.to_csv("df_movies.csv")

The final table has 2.5 million rows, 21 columns

df_movies.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 2549460 entries, 0 to 2549459
Data columns (total 21 columns):
# Column Dtype
--- ------ -----
0 tconst object
1 titleType object
2 primaryTitle object
3 originalTitle object
4 startYear float64
5 runtimeMinutes object
6 genres object
7 averageRating float64
8 numVotes int64
9 directors object
10 writers object
11 ordering int64
12 nconst object
13 category object
14 job object
15 characters object
16 primaryName object
17 birthYear float64
18 deathYear float64
19 primaryProfession object
20 knownForTitles object
dtypes: float64(4), int64(2), object(15)
memory usage: 427.9+ MB

I uploaded the data file (560MB) on WeTransfer which can be downloaded from the link below (WeTransfer deletes the uploads in 7 days, you can download the data until 14/04/2020):

https://we.tl/t-cQJVvEjZgv

You can do all the analysis above with this data as well as implementing your own ideas.

Cheers!

--

--