Exploratory Data Analysis with SQL

8 min readApr 4, 2024

With a database of 350,000+ movies and actors, let’s see what we got

First off, I’d like to give a huge thanks to Stephane Rappeneau for uploading this data to Kaggle. The original datasets can be found here.

I love movies.

Ever since the day my parents set up the ‘magic box’ that gave us ‘unlimited’ pay-per-view movies, I couldn’t get enough. Name any major movie and I can probably give you the release date within a couple years of its first screening, plus the lead actors and the director. Of course, I love shows as well, but there’s a distinct magic that comes with telling a story that somehow feels complete but often only lasts less than a couple of hours. When I stumbled upon this dataset, and with that profound love for movies in my heart, I knew precisely where I’d dedicate some time reinforcing my understanding of SQL and some foundational database concepts. Without further ado, let’s take a look.

Step 1 — Importing the Dataset

The first issue I came across was loading the data into the MySQL database. The phpMyAdmin tool is an excellent free software application that helps facilitate MySQL over the internet, but it has a limit of 40 MB for importing datasets. The `movie_details` table alone accounts for 130 MB. To get around this minor speed bump, I utilized the SQLAlchemy toolkit to import my data.

# import sqlalchemy libraries

from sqlalchemy import create_engine, Column, select, Integer, Text, 
MetaData, Table, String, Date, ForeignKey, DateTime, Float

engine = create_engine('mysql://root:@localhost/movies_db', echo=False)
metadata = MetaData()

For the engine and metadata variables, we’re linking the work that we’ll be doing with SQLAlchemy to the database being run by our computer.

From there, we need to import the data into our Jupyter Notebooks script with Panda’s read_csv function.

import numpy as np
from math import pi
import pandas as pd

yup = pd.read_csv(r'...AllMoviesDetailsCleaned.csv', sep = ';')

okay = pd.read_csv(r'...AllMoviesCastingRaw.csv', sep = ';')

Now, let’s clean up some of the column names and put the Release_Date column into a datetime format.

cols = ['ID', 'Budget', 'Genres', 'IMDB_ID', 'Original_Language',
       'Original_Title', 'Overview', 'Popularity', 'Production_Companies',
       'Production_Countries', 'Release_Date', 'Revenue', 'Runtime',
       'Spoken_Languages', 'Status', 'Tagline', 'Title', 'Vote_Average',
       'Vote_Count', 'Production_Companies_Count', 
       'Production_Countries_Count', 'Spoken_Languages_Count']

yup.columns = cols

yup['Release_Date'] = pd.to_datetime(yup['Release_Date'], dayfirst = True)

Here’s how the two tables look before importing into the database:

I wanted to practice inserting data into a ready-made table and simply loading the dataset as-is with the two tables, so let’s begin with the ready-made table.

To determine the characteristics of the variables within the records, I decided to get the min/max values of each column. The Overview column explains the plot of the films and includes loads of text. Meanwhile, the Original_Language column only uses two-letter initials. The range is pretty outstanding, so let’s use str.len().min() and str.len().max() on the non-integer columns to get some answers.

print(yup['Genres'].str.len().min())
print(yup['Genres'].str.len().max())

print(yup['IMDB_ID'].str.len().min())
print(yup['IMDB_ID'].str.len().max())

print(yup['Original_Language'].str.len().min())
print(yup['Original_Language'].str.len().max())

print(yup['Original_Title'].str.len().min())
print(yup['Original_Title'].str.len().max())

print(yup['Overview'].str.len().min())
print(yup['Overview'].str.len().max())

print(yup['Production_Companies'].str.len().min())
print(yup['Production_Companies'].str.len().max())

print(yup['Production_Countries'].str.len().min())
print(yup['Production_Countries'].str.len().max())

print(yup['Spoken_Languages'].str.len().min())
print(yup['Spoken_Languages'].str.len().max())

print(yup['Status'].str.len().min())
print(yup['Status'].str.len().max())

print(yup['Tagline'].str.len().min())
print(yup['Tagline'].str.len().max())

print(yup['Title'].str.len().min())
print(yup['Title'].str.len().max())

With some accurate counts on the lengths of these columns, let’s make our database table using SQLAlchemy.

movie_details = Table(
    'movie_details', metadata,
    Column("ID", Integer, primary_key = True),
    Column("Budget", Integer),
    Column("Genres", String(100)),
    Column("IMDB_ID", String(12)),
    Column("Original_Language", String(2)),
    Column("Original_Title", String(210)),
    Column("Overview", String(1000)),
    Column("Popularity", Float),
    Column("Production_Companies", String(150)),
    Column("Production_Countries", String(50)),
    Column("Release_Date", Date),
    Column("Revenue", Integer),
    Column("Runtime", Float),
    Column("Spoken_Languages", String(20)),
    Column("Status", String(20)),
    Column("Tagline", String(1000)),
    Column("Title", String(250)),
    Column("Vote_Average", Float),
    Column("Vote_Count", Integer),
    Column("Production_Companies_Count", Integer),
    Column("Production_Countries_Count", Integer),
    Column("Spoken_Languages_Count", Integer),
)

movie_details.create(bind=engine)

To load the data into the table, we have the handy to_sql function.

yup.to_sql('movie_details', con=engine, if_exists='append', index=False)

And, boom, there we are — everything from the yup dataset has been loaded into the movie_details table.

Without cleaning up the okay dataset, we can still import it into phpMyAdmin and the to_sql function does its job very well. If there’s any cleaning to be done, it’s mostly cosmetic or a matter of preference.

okay.to_sql('movie_casting', con=engine, if_exists='replace', index=False)

Step 2 — Business Questions

Moving on to the next stage, let’s discuss some of the things we’d like to learn from our newly created database.

How many ‘big-budget’ movies are there? (> 100,000,000)
How many ‘highly-rated’ movies? (> 8.0)
How many movies have a low sample size of reviews? (< 20)
How many movies have no tagline?
How many were produced in the US?
How many countries have produced movies?
What are some of the oldest movies out there?
How many small-budget movies have more than 2–3 production companies?
How many movies retained their original titles?
What’s the average run-time?
What is the average cost of movies with reported budgets? (>1)
Who was the main actor or actress for the biggest flop?
Who are the top 10 main actors or actresses by appearance?
How many movies have English as their primary language as a percentage?
For each country, which movie had the most revenue?
What is the gender breakdown for the main actors among the first ten movies listed?
What is the running count for revenue of the fifteen highest-earning films produced in France?

Step 3 — Business Answers

#1 — How many ‘big-budget’ movies are there? (> 100,000,000)

SELECT * FROM movie_details
WHERE budget > 100000000;

-- Results in 342 records

#2 — How many ‘highly-rated’ movies? (> 8.0)

SELECT * FROM movie_details
WHERE vote_average > 8.0;

-- Results in 11,160 records

#3 — How many movies have a low sample size of reviews? (< 20)

SELECT * FROM movie_details
WHERE vote_count < 20;

-- Results in 313,489 records

#4 — How many movies have no tagline?

SELECT count(*) FROM `movie_details`
WHERE tagline IS NULL;

-- Results in 276,133 records

#5 — How many were produced in the US?

-- first, gotta check the distinct values under production_countries

SELECT DISTINCT production_countries FROM `movie_details`;

-- okay, United States has two options: United States of America... 
-- ... and United States Minor Outlying Islands

SELECT count(*) FROM `movie_details` 
WHERE Production_Countries = 'United States of America' 
OR 'United States Minor Outlying Islands';

-- Results in 55,504 records

#6 — How many countries have produced movies?

SELECT COUNT(DISTINCT(Production_Countries)) FROM `movie_details`;

-- There are 235 different countries

#7 — What are some of the oldest movies out there?

SELECT * FROM `movie_details`
WHERE Release_Date IS NOT NULL
ORDER BY Release_Date
LIMIT 5 OFFSET 10;

#8 — How many small-budget movies have more than 2–3 production companies?

SELECT count(*) FROM `movie_details`
WHERE( 
 budget BETWEEN 1 AND 50000
)
AND
(
    Production_Companies_Count >= 2
)
;

-- Results in 517 records

#9 — How many movies retained their original titles?

SELECT count(*) FROM `movie_details`
WHERE Original_Title = Title
;

-- Results in 267,431 records

#10 — What’s the average run-time?

SELECT ROUND(AVG(CAST(Runtime AS FLOAT)), 2) FROM `movie_details`;

-- The average is 60.41 minutes

#11 — What is the average cost of movies with reported budgets? (>1)

SELECT AVG(Budget) FROM `movie_details`
WHERE Budget >= 1;

-- The average is 12,779,511.82

#12 — Who was the main actor or actress for the biggest flop?

SELECT details.Title, details.Budget, details.Revenue, details.Budget, cast.actor1_name
FROM movie_details details
JOIN movie_casting cast
 on details.ID = cast.id
WHERE details.budget > 100000000 and details.Revenue BETWEEN 1 AND 50000000
order by 1;

-- It was Nicole Kidman for Australia (2008)

#13 — Who are the top 10 main actors or actresses by appearance?

DROP TABLE IF EXISTS actor_appearances;

CREATE TABLE actor_appearances(
    id bigint(20),
    Title varchar(255),
    `actor1_name` varchar(255),
    `actor2_name` varchar(255),
    `actor3_name` varchar(255),
    `actor4_name` varchar(255),
    `actor5_name` varchar(255)
);


INSERT INTO actor_appearances
SELECT cast.id, details.Title, cast.actor1_name, cast.actor2_name, cast.actor3_name, cast.actor4_name, cast.actor5_name
FROM movie_casting AS cast
LEFT JOIN movie_details AS details
 ON cast.id = details.ID;
    
SELECT actor1_name, COUNT(*) AS credits FROM actor_appearances
WHERE actor1_name != 'None'
GROUP BY actor1_name
ORDER BY credits DESC LIMIT 10;

#14 — How many movies have English as their primary language as a percentage?

SELECT COUNT(CASE Original_Language WHEN 'en' then 1 else null end) as ENG_TOTAL, 
(COUNT(CASE Original_Language WHEN 'en' then 1 else null end)/COUNT(*)) * 100 as ENG_PCT
FROM movie_details;

-- There are 216,831 movies in English, making up about 65.9% of all movies

#15 — For each country, which movie had the most revenue?

SELECT Production_Countries, Title, Budget, MAX(Revenue)
FROM movie_details
WHERE Revenue > 0 AND Budget > 0
GROUP BY 1
ORDER BY 4 DESC;

#16 — What is the gender breakdown for the main actors among the first ten movies listed?

WITH actor_genders AS 
(
    SELECT id, actor1_gender, actor2_gender, actor3_gender, actor4_gender, actor5_gender
    FROM `movie_casting`
    WHERE actor1_name != 'None'
    ORDER BY id
    LIMIT 10
)

SELECT
SUM(NTA_1 + NTA_2 + NTA_3 + NTA_4 + NTA_5) AS NTA_Total,
SUM(Girls_1 + Girls_2 + Girls_3 + Girls_4 + Girls_5) AS Girls_Total,
SUM(Boys_1 + Boys_2 + Boys_3 + Boys_4 + Boys_5) AS Boys_Total
FROM
(
 SELECT
    COUNT(CASE WHEN actor1_gender = 0 THEN actor1_gender END) AS NTA_1,
 COUNT(CASE WHEN actor1_gender = 1 THEN actor1_gender END) AS Girls_1,
 COUNT(CASE WHEN actor1_gender = 2 THEN actor1_gender END) AS Boys_1,
 COUNT(CASE WHEN actor2_gender = 0 THEN actor2_gender END) AS NTA_2,
 COUNT(CASE WHEN actor2_gender = 1 THEN actor2_gender END) AS Girls_2,
 COUNT(CASE WHEN actor2_gender = 2 THEN actor2_gender END) AS Boys_2,
 COUNT(CASE WHEN actor3_gender = 0 THEN actor3_gender END) AS NTA_3,
 COUNT(CASE WHEN actor3_gender = 1 THEN actor3_gender END) AS Girls_3,
 COUNT(CASE WHEN actor3_gender = 2 THEN actor3_gender END) AS Boys_3,
 COUNT(CASE WHEN actor4_gender = 0 THEN actor4_gender END) AS NTA_4,
 COUNT(CASE WHEN actor4_gender = 1 THEN actor4_gender END) AS Girls_4,
 COUNT(CASE WHEN actor4_gender = 2 THEN actor4_gender END) AS Boys_4,
 COUNT(CASE WHEN actor5_gender = 0 THEN actor5_gender END) AS NTA_5,
 COUNT(CASE WHEN actor5_gender = 1 THEN actor5_gender END) AS Girls_5,
 COUNT(CASE WHEN actor5_gender = 2 THEN actor5_gender END) AS Boys_5
 FROM actor_genders
) as subquery;

#17 — What is the running count for revenue of the fifteen highest-earning films produced in France?

SELECT Title, Production_Countries, ID, Revenue, SUM(Revenue) OVER(ORDER BY Revenue DESC) AS Rolling_Revenue
FROM movie_details
WHERE Revenue > 0 and Production_Countries = 'France'
ORDER BY 4 DESC
LIMIT 15;

Opportunities for Further Study

Beyond creating a dashboard to allow for people to play with the figures themselves, I believe this data can answer some interesting questions like:

What is the relationship between the number of actors on set and budget?
Which countries appear to be the cheapest ones to produce films based on budget size and revenue?
Which country has the most successful moviemaking industry outside of the United States?

Wrap-up

Thanks for taking the time to read my article on exploratory data analysis utilizing SQL (and a bit of Python).

If you’d like to take a look at my GitHub for this project, you can find it here.

Have a great day!