01. Investigate TMDb Movie Dataset (Python Data Analysis Project) — Part 1 Data Wrangling
Note: This project was completed as a part of Udacity Data Analyst Nanodegree that I finished in March, 2018. For full project reports, codes and dataset files, see my Github repository.
For part 2 of the project — Exploratory Data Analysis , see the next article.
This is my first data analysis related project. After engaging in a lot, I got pass for just one time and Udacity reviewer rated it as very great job, including questions digging and data wrangling. As a data science newbie and self-learner, this definitely encouraged me a lot. Hence, through this article I want to record this project main ideas and the techniques I learned so far as my first analysis project milestone.
Introduction
The primary goal of the project is to go through the general data analysis process — using basic data analysis technique with NumPy, pandas, and Matplotlib. It contains four parts:
1. Questions Asking
2. Data Wrangling
3. Exploratory Data Analysis
4. Drawing Conclusion
The movie dataset, which is originally from Kaggle, was cleaned and provided by Udacity. According Kaggle introduction page, the data contains information that are provided from The Movie Database (TMDb). It collects 5000+ movies basic move information and movie matrices, including user ratings, popularity and revenue data. These metrics can be seen as how successful these movies are. The movie basic information contained like cast, director, keywords, runtime, genres, etc. (Remark: The dataset on Kaggle page may be updated to new version.) And the whole dataset duration covers from 1960 to 2015.
Since it contains plentiful information, the project I investigated is focus on finding properties are associated with successful movies. Besides, the dataset also contains the movie released year, so it also can explore some interesting trends in these movie metrics.
Question Researched
Questions in the projects are as follows:
Part 1: General Explore
- Question 1: Popularity Over Years
- Question 2: The distribution of revenue in different popularity levels in recent five years.
- Question 3: The distribution of revenue in different score rating levels in recent five years.
Part 2: Find the Properties are Associated with Successful Movies
- Question 1: What kinds of properties are associated with movies that have high popularity?
- Question 2: What kinds of properties are associated with movies that have high voting score?
Part 3: Top Keywords and Genres Trends by Generation
- Question 1: Number of movie released year by year
- Question 2: Keywords Trends by Generation
- Question 3: Genres Trends by Generation
Data Wrangling
Cleaning Decision
In this process the main idea is to take a quick glance on the data set, find the potential unreasonable data value, unnecessary variables for my research question, null data or duplicates, and then make data clearing decisions.
1. Basic Exploration
First, let’s look what the dataset looks like for preceding investigating. Import the necessary package and use pd.read_csv
to load the movie dataset, then print the first rows.
We can see that these data are pretty neat, except that the cast
, keywords
, genres
and production_companies,
they are strings format but contains with multiple values and separated by|
character. Should it be separated and put each of them by column? Or just use function to retrieve the information I want? I left out the issue on the later stage.
2. Null values and zero values
Then, use info
to find the basic information like data format, the number of null values in each column in the dataset.
From the table above, there are totally 10866 entries and total 21 columns. Each data format are reasonable, and there exists some null value in the cast
, director
, overview
and genres
columns, but just in tiny amounts, so I decided to drop them. However, some columns are with a lot of null values like homepage
, tagline
, keywords
and production_companies
, especially the homepage
and tagline
column are even not necessary for answering the questions, so I decide to drop both of the columns on the stage, and I kept the keywords
and production_companies
in case I drop too much data.
Let’s see some descriptive statistics for the data set. Use describe
.
As the table shown above, we can find outliers in popularity
data because the max value is 32.99 while other quantiles are just around 0.2–0.7. Then I found the forum, which mentioned the popularity score is measured by counting unique views on the website or counting number of favorites and number of watched list, etc. And it has no upper-bond. I think how they measure the popularity matrices is just like a cumulative concept — more higher means more popular, no upper bond looks like reasonable, so I decided to retain the original data.
Also, there are a lot of zero number in budget
and revenue
data, accounted at least 50% of each columns. And there are some zero inruntime
. Didn't these movies with zero values be released? Or does it just no value? Look at the data in release_year column
, I find all movies in the dataset are released because the minimum value is 1960 and there is no null value or zero for it. So I assume the zero values in the budget
and revenue
are missing data. But under the risk that these zero values may be just small values and record as “0”, I preceded to take a look for some zero data to decide whether it is just a missing value or small value.
2.1 Zero Values in Budget and Revenue Columns
Let’s take a look at some zero budget
and revenue
data. I used query
to filter the zero budget data and revenue data.
Among the budget
data in zero values, I randomly chose Mr. Holmes and google searched it. And I found it's Wikipedia page and there is definitely a budget record in $10 million. And I also found the record for revenue
data in other website resource for Wild Card. They are not in a small quantities. Hence, I assume the zero value in revenue and budget column are missing in the dataset. I had better to set them as null values or drop directly. To decide to drop them out or just set them as null values, I considered that if I include these zero number in dateset, It will affect some statistics and the visualization result in those question, while if I decided to drop them, maybe I will lose a lot of rows for my investigate.
To decide whether to drop them out or set them as null values, I count the number of the zero values in the two columns.
First count the zero value in the zero budget
dataframe . I used groupby
group data based on budget and just list the first two one to count the number of zero.
In the budget
cloumn, there are 5696 rows in zero value. It’s kind of huge amounts. In case I drop too many raw data to keep the data integrity, I decide to retain these rows and replace zero values with null values.
So does the revenue
column.
It contains 6016 rows in zero values, so I also decide to keep these rows and replace zero values with null values.
2.2 Zero Values in Runtime Columns
Finally, let’s investigate the runtime
column to decide whether drop zero or just replace it with null value.
It’s just has a small number of zero value rows in runtime column, so I decide to drop them.
Cleaning Decision Summary
- Drop unnecessary columns for answering those questions :
homepage
,tagline
,imdb_id
,overview
,budget_adj
,revenue_adj
. - Drop duplicates.
- Drop null values columns that with small quantity of nulls :
cast
,director
, andgenres
. - Replace zero values with null values in the
budget
andrevenue
column. - Drop zero values columns that with small quantity of zeros :
runtime
.
Data Cleaning
According to the previous decision, first let’s drop unnecessary columns : imdb_id
, homepage
, tagline
, overview
. Use drop
to delete these columns.
Then, use drop_duplicates
to drop the duplicates.
Then, drop the null values with dropna
in cast
, director
, genres
columns since they are just in tiny amount.
Check out the dataset status after dropping null values so far. Use isnull().sum().
We can see that there are no null values except the keywords
and production_companies
that I decided to keep before.
Then, replace zero values with null values in the budget and revenue column with replace(0, np.NaN).
Check out the result. As the table shown below, we can see that the budget
columns has replaced 5153 zero values with 5153 null values. What a huge amounts! If I did drop them, it would have lost a lot of useful information!
Finally, drop columns with small quantity of zero values : runtime.
Check out it.
Cleaning Result Summary
From the table bellow, we can see that the data in each column are almost clear without too many null values. My clearing goal is to keep the data integrity from the original one, although there are some null values in keywords
and production companies
columns, it is still useful for analyze, and in fact the number of their null values are not very huge, so I just kept both of them. The data now with 10703 entries and 17 columns.
And from the table bellow, after transfer all zero values to null values in budget
and revenue
data, we can see that both the distribution of budget and revenue are much better, without too concentrate on the zero value or small values. And after deleting the zero values of runtime
, we can see the minimum value of runtime
is more reasonable.
This is my first part for the project, for the next part I will post the Exploratory Data Analysis part and question finding results! See the part 2 article.