TMDb Movie Data Analysis

Abdulraqib Omotosho
5 min readDec 28, 2022

--

Photo by Jakob Owens on Unsplash

This is a part of the projects I did in the Udacity Data Analyst Nanodegree Program. It is divided into 3 different parts. The introduction, data wrangling and the exploratory data analysis.

The Introduction

The data set contains observation of about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. You can find more about the dataset here. In this project, I’m going to investigate the dataset to find out the movies with highest profits, hits and other interesting stuff about movies from the data.

import libraries and packages
reading the dataset.

Data Wrangling

The dataset has 10866 rows and columns.

Dataset columns

Next, I’ll drop some columns here as they are not important in the analysis.

checking for duplicate rows and removing.

Some columns in the data contain zero(0) values. Let’s check them.

replacing the zeros with null values

Next, let’s have an overview of the null values in the dataset.

There is a significant number of nulls in some columns.
Null values are now gone

The release_date column is a string. I’ll convert it to a datetime column.

Some columns contain multiple values separated by pipe (|) characters. Let’s see them..

Next, I’ll create new dataframes to hold the separated columns.

concatenating the columns to the dataset

dropping the columns that contained pipes.
columns of the new data.

Getting index labels of the new columns added to the dataframe.

Concatenating all the separated columns into a single column and dropping the null values.

drop the index labels of the new columns.

Thereafter, save the data to a new file for further analysis. Note: You might also choose to make a copy of the data to proceed to use.

The first five columns of the dataset.

Exploratory Data Analysis

Now that the data is cleaned, we can proceed to explore it.

correlation between features.

Now, I’ll proceed to answering some questions that I posed using this data.

1. What movies had the highest and least revenue?

Comic movies generated the highest revenue while Science Fiction, Animation had the lowest revenue.

2. What movies had the highest and least budget?

The same genre of movies with the highest and lowest revenue had the same for budget.

3. Movies with the highest and lowest runtime.

Top 5 movies with the highest runtime

4. Genres with the highest and lowest runtime.

Drama genres had the longest runtimes.
Genres with the longest running times.

5. Most profitable movies

Avatar was the most profitable movie.

6. Hit movies!(most popular movies)

7. Number of movie genres released over the years

8. Most popular genres

Adventure,Drama,Science Fiction genres were the most popular.

9. Action movie releases over time

Action movie releases

10. Month with highest number of movie releases

September was the month with most movie release.

11. Day with highest number of movie releases

Most movies were released on Friday.

So here comes the end of my analysis of the IMDB data. The dataset was somewhat limiting as many entries in the dataset were dropped due to missing data. Also, some zeros were turned into null values which significantly increased the number of missing data in the dataset.

Thanks for reading, I hope you enjoyed it. If you like this, check out my GitHub page for the notebook. Ensure you also follow me. Cheers!

--

--

Abdulraqib Omotosho

Passionate Data & AI Enthusiast. Computer Engineering student. Skilled in data analysis, modeling, and programming. Sharing insights on Medium.