Let’s start the journey to explore more! Taken in Nenggao Cross-ridge Historic Trai, Taiwan

01. Investigate TMDb Movie Dataset (Python Data Analysis Project) — Part 1 Data Wrangling

Lorna Yen
8 min readMay 12, 2018

--

Note: This project was completed as a part of Udacity Data Analyst Nanodegree that I finished in March, 2018. For full project reports, codes and dataset files, see my Github repository.

For part 2 of the project — Exploratory Data Analysis , see the next article.

This is my first data analysis related project. After engaging in a lot, I got pass for just one time and Udacity reviewer rated it as very great job, including questions digging and data wrangling. As a data science newbie and self-learner, this definitely encouraged me a lot. Hence, through this article I want to record this project main ideas and the techniques I learned so far as my first analysis project milestone.

Introduction

The primary goal of the project is to go through the general data analysis process — using basic data analysis technique with NumPy, pandas, and Matplotlib. It contains four parts:

1. Questions Asking

2. Data Wrangling

3. Exploratory Data Analysis

4. Drawing Conclusion

The movie dataset, which is originally from Kaggle, was cleaned and provided by Udacity. According Kaggle introduction page, the data contains information that are provided from The Movie Database (TMDb). It collects 5000+ movies basic move information and movie matrices, including user ratings, popularity and revenue data. These metrics can be seen as how successful these movies are. The movie basic information contained like cast, director, keywords, runtime, genres, etc. (Remark: The dataset on Kaggle page may be updated to new version.) And the whole dataset duration covers from 1960 to 2015.

Since it contains plentiful information, the project I investigated is focus on finding properties are associated with successful movies. Besides, the dataset also contains the movie released year, so it also can explore some interesting trends in these movie metrics.

Question Researched

Questions in the projects are as follows:

Part 1: General Explore

  • Question 1: Popularity Over Years
  • Question 2: The distribution of revenue in different popularity levels in recent five years.
  • Question 3: The distribution of revenue in different score rating levels in recent five years.

Part 2: Find the Properties are Associated with Successful Movies

  • Question 1: What kinds of properties are associated with movies that have high popularity?
  • Question 2: What kinds of properties are associated with movies that have high voting score?

Part 3: Top Keywords and Genres Trends by Generation

  • Question 1: Number of movie released year by year
  • Question 2: Keywords Trends by Generation
  • Question 3: Genres Trends by Generation

Data Wrangling

Cleaning Decision

In this process the main idea is to take a quick glance on the data set, find the potential unreasonable data value, unnecessary variables for my research question, null data or duplicates, and then make data clearing decisions.

1. Basic Exploration

First, let’s look what the dataset looks like for preceding investigating. Import the necessary package and use pd.read_csv to load the movie dataset, then print the first rows.

We can see that these data are pretty neat, except that the cast, keywords, genres and production_companies, they are strings format but contains with multiple values and separated by| character. Should it be separated and put each of them by column? Or just use function to retrieve the information I want? I left out the issue on the later stage.

2. Null values and zero values

Then, use info to find the basic information like data format, the number of null values in each column in the dataset.

From the table above, there are totally 10866 entries and total 21 columns. Each data format are reasonable, and there exists some null value in the cast, director, overview and genres columns, but just in tiny amounts, so I decided to drop them. However, some columns are with a lot of null values like homepage, tagline, keywords and production_companies, especially the homepage and tagline column are even not necessary for answering the questions, so I decide to drop both of the columns on the stage, and I kept the keywords and production_companies in case I drop too much data.

Let’s see some descriptive statistics for the data set. Use describe.

As the table shown above, we can find outliers in popularity data because the max value is 32.99 while other quantiles are just around 0.2–0.7. Then I found the forum, which mentioned the popularity score is measured by counting unique views on the website or counting number of favorites and number of watched list, etc. And it has no upper-bond. I think how they measure the popularity matrices is just like a cumulative concept — more higher means more popular, no upper bond looks like reasonable, so I decided to retain the original data.

Also, there are a lot of zero number in budget and revenue data, accounted at least 50% of each columns. And there are some zero inruntime. Didn't these movies with zero values be released? Or does it just no value? Look at the data in release_year column, I find all movies in the dataset are released because the minimum value is 1960 and there is no null value or zero for it. So I assume the zero values in the budget and revenue are missing data. But under the risk that these zero values may be just small values and record as “0”, I preceded to take a look for some zero data to decide whether it is just a missing value or small value.

2.1 Zero Values in Budget and Revenue Columns

Let’s take a look at some zero budget and revenue data. I used query to filter the zero budget data and revenue data.

Among the budget data in zero values, I randomly chose Mr. Holmes and google searched it. And I found it's Wikipedia page and there is definitely a budget record in $10 million. And I also found the record for revenue data in other website resource for Wild Card. They are not in a small quantities. Hence, I assume the zero value in revenue and budget column are missing in the dataset. I had better to set them as null values or drop directly. To decide to drop them out or just set them as null values, I considered that if I include these zero number in dateset, It will affect some statistics and the visualization result in those question, while if I decided to drop them, maybe I will lose a lot of rows for my investigate.

On Mr. Holmes Wiki page with budget record

To decide whether to drop them out or set them as null values, I count the number of the zero values in the two columns.

First count the zero value in the zero budget dataframe . I used groupby group data based on budget and just list the first two one to count the number of zero.

In the budget cloumn, there are 5696 rows in zero value. It’s kind of huge amounts. In case I drop too many raw data to keep the data integrity, I decide to retain these rows and replace zero values with null values.

So does the revenue column.

It contains 6016 rows in zero values, so I also decide to keep these rows and replace zero values with null values.

2.2 Zero Values in Runtime Columns

Finally, let’s investigate the runtime column to decide whether drop zero or just replace it with null value.

It’s just has a small number of zero value rows in runtime column, so I decide to drop them.

Cleaning Decision Summary

  1. Drop unnecessary columns for answering those questions : homepage, tagline, imdb_id, overview,budget_adj, revenue_adj.
  2. Drop duplicates.
  3. Drop null values columns that with small quantity of nulls : cast, director, and genres.
  4. Replace zero values with null values in the budget and revenue column.
  5. Drop zero values columns that with small quantity of zeros : runtime.

Data Cleaning

According to the previous decision, first let’s drop unnecessary columns : imdb_id, homepage, tagline, overview. Use drop to delete these columns.

Then, use drop_duplicates to drop the duplicates.

Then, drop the null values with dropna in cast, director, genres columns since they are just in tiny amount.

Check out the dataset status after dropping null values so far. Use isnull().sum(). We can see that there are no null values except the keywords and production_companies that I decided to keep before.

Then, replace zero values with null values in the budget and revenue column with replace(0, np.NaN).

Check out the result. As the table shown below, we can see that the budget columns has replaced 5153 zero values with 5153 null values. What a huge amounts! If I did drop them, it would have lost a lot of useful information!

Finally, drop columns with small quantity of zero values : runtime.

Check out it.

Cleaning Result Summary

From the table bellow, we can see that the data in each column are almost clear without too many null values. My clearing goal is to keep the data integrity from the original one, although there are some null values in keywords and production companies columns, it is still useful for analyze, and in fact the number of their null values are not very huge, so I just kept both of them. The data now with 10703 entries and 17 columns.

And from the table bellow, after transfer all zero values to null values in budget and revenue data, we can see that both the distribution of budget and revenue are much better, without too concentrate on the zero value or small values. And after deleting the zero values of runtime, we can see the minimum value of runtime is more reasonable.

This is my first part for the project, for the next part I will post the Exploratory Data Analysis part and question finding results! See the part 2 article.

--

--