TMDb Movie Data Analysis
This is a part of the projects I did in the Udacity Data Analyst Nanodegree Program. It is divided into 3 different parts. The introduction, data wrangling and the exploratory data analysis.
The Introduction
The data set contains observation of about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. You can find more about the dataset here. In this project, I’m going to investigate the dataset to find out the movies with highest profits, hits and other interesting stuff about movies from the data.
Data Wrangling
The dataset has 10866 rows and columns.
Next, I’ll drop some columns here as they are not important in the analysis.
Some columns in the data contain zero(0) values. Let’s check them.
Next, let’s have an overview of the null values in the dataset.
The release_date column is a string. I’ll convert it to a datetime column.
Some columns contain multiple values separated by pipe (|) characters. Let’s see them..
Next, I’ll create new dataframes to hold the separated columns.
concatenating the columns to the dataset
Getting index labels of the new columns added to the dataframe.
Concatenating all the separated columns into a single column and dropping the null values.
Thereafter, save the data to a new file for further analysis. Note: You might also choose to make a copy of the data to proceed to use.
Exploratory Data Analysis
Now that the data is cleaned, we can proceed to explore it.
Now, I’ll proceed to answering some questions that I posed using this data.
1. What movies had the highest and least revenue?
Science Fiction, Animation
had the lowest revenue.2. What movies had the highest and least budget?
3. Movies with the highest and lowest runtime.
4. Genres with the highest and lowest runtime.
5. Most profitable movies
6. Hit movies!(most popular movies)
7. Number of movie genres released over the years
8. Most popular genres
9. Action movie releases over time
10. Month with highest number of movie releases
11. Day with highest number of movie releases
So here comes the end of my analysis of the IMDB data. The dataset was somewhat limiting as many entries in the dataset were dropped due to missing data. Also, some zeros were turned into null values which significantly increased the number of missing data in the dataset.
Thanks for reading, I hope you enjoyed it. If you like this, check out my GitHub page for the notebook. Ensure you also follow me. Cheers!