TMDb Movie Data Analysis

5 min readDec 28, 2022

This is a part of the projects I did in the Udacity Data Analyst Nanodegree Program. It is divided into 3 different parts. The introduction, data wrangling and the exploratory data analysis.

The Introduction

The data set contains observation of about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. You can find more about the dataset here. In this project, I’m going to investigate the dataset to find out the movies with highest profits, hits and other interesting stuff about movies from the data.

reading the dataset.

Data Wrangling

The dataset has 10866 rows and columns.

Next, I’ll drop some columns here as they are not important in the analysis.

checking for duplicate rows and removing.

Some columns in the data contain zero(0) values. Let’s check them.

Next, let’s have an overview of the null values in the dataset.

There is a significant number of nulls in some columns.

The release_date column is a string. I’ll convert it to a datetime column.

Some columns contain multiple values separated by pipe (|) characters. Let’s see them..

Next, I’ll create new dataframes to hold the separated columns.

concatenating the columns to the dataset

dropping the columns that contained pipes.

Getting index labels of the new columns added to the dataframe.

Concatenating all the separated columns into a single column and dropping the null values.

drop the index labels of the new columns.

Thereafter, save the data to a new file for further analysis. Note: You might also choose to make a copy of the data to proceed to use.

The first five columns of the dataset.

Exploratory Data Analysis

Now that the data is cleaned, we can proceed to explore it.

Now, I’ll proceed to answering some questions that I posed using this data.

1. What movies had the highest and least revenue?

Comic movies generated the highest revenue while Science Fiction, Animation had the lowest revenue.

2. What movies had the highest and least budget?

The same genre of movies with the highest and lowest revenue had the same for budget.

3. Movies with the highest and lowest runtime.

4. Genres with the highest and lowest runtime.

5. Most profitable movies

6. Hit movies!(most popular movies)

7. Number of movie genres released over the years

8. Most popular genres

Adventure,Drama,Science Fiction genres were the most popular.

9. Action movie releases over time

10. Month with highest number of movie releases

September was the month with most movie release.

11. Day with highest number of movie releases

So here comes the end of my analysis of the IMDB data. The dataset was somewhat limiting as many entries in the dataset were dropped due to missing data. Also, some zeros were turned into null values which significantly increased the number of missing data in the dataset.

Thanks for reading, I hope you enjoyed it. If you like this, check out my GitHub page for the notebook. Ensure you also follow me. Cheers!