Data analysis process using Python libraries to analyze a dataset of 10,000+ movies.

Udacity Nanodegree Project 1: Investigate a dataset.

Data analysis process consist of five key steps:

Photo by KACHALI SARMA on Kaggle

For this project, I went through the data analysis process using Python libraries to analyze a dataset and then communicate my findings about it. The dataset contains information of about 10,000+ movies cleaned from the original TMdb movie data on kaggle. The movies in the dataset were released between the year 1960–2015 and some other features of the information in the dataset includes genre, popularity, budget and revenue.

In this article, I’ll summarily explain my data analysis process using the Python libraries numpy, Pandas, and Matplotlib with the aim to find answers for the following questions:

1. What are the features of popular movies?

2. Which genres are most popular from year to year?

Data wrangling, is the process to get the data we need in a form we can work with in three steps: gather, assess, and clean.

import python data analysis libraries (Screenshot from Jupyter Notebook written by Gilbert Adikankwu)

For the data gathering step, I downloaded the data as a CSV file from a google doc link provided by Udacity, loaded it into a pandas Dataframe. Then I assessed the data to identify any quality or structure issues. For the data cleaning step, all the identified quality and structural issues was cleaned by modifying, replacing, or removing data to ensure the dataset is of the highest quality and as well-structured as possible. Like it has been repeated by many data professionals, 80% of a data analysis project revolves around cleaning the data. Missing values are always a big issue in any data project. In this instance, it was stated by the authors of the original dataset that zeros in the budget column should be treated as missing, with the caveat that missing budgets are much more likely to have been from small budget films in the first place. You can see the detailed wrangling process on my Github.

Pandas Data frame showing movies dataset (Screenshot from Jupyter Notebook written by Gilbert Adikankwu)

Data analysis is an iterative process and what that means is that even after completing the data wrangling step, if a new issue is identified during exploratory analysis you can clean it before continuing with your analysis. This iterative process became evident as I wanted to get insight for question 2 because I observed movies where categorized with multiple unique genres separated by pipe characters and I needed to consider all the genre to find out the most popular genres because if I chose just one genre for each movie the analysis would be inconsistent and biased.

Therefore, I wrote a logic code to split the grouped genre into their individual columns for the movies attached.

Screenshot from Jupyter Notebook written by Gilbert Adikankwu

Q1: What are the features of popular movies?

To get insights for this question, I did the summary statistics of the popularity values of the movies in the dataset and observed from the result, that movies with popularity rating above 1 may be considered as popular given that the average is 1 and the maximum 32.98. Also, I did a scatter plot comparison of popularity ratings vs several features and observed that the runtimes of each movies showed the strongest correlation popularity.

Take note: This are just statistical observation and should not be taken as conclusions. Lack of enough data on the movie budgets hindered any viable conclusion for this analysis.

Q2: Which genres are most popular from year to year?

To achieve this insights, I used Matplotlib to plot the total number of movies categorized under each genre vs each release year.

Most Popular Genre From Year To Year (Plot generated by Gilbert Adikankwu using Matplotlib library for Python)

The drama genre is the most popular from year to year within the timeframe release years in the dataset. From the chart we can observe that the Drama genre held the top spot as the genre with the most numbers of movies from the 1960’s to the 1980’s and only dropped briefly to second place in the early 1990’s when the Action genre had an uptick in numbers of movies but it then regained the top spot again in the late 1990’s and ever since has experienced exponential growth.

The Horror genre is the least popular genre from year to year.

The lack of sufficient data in the budget hindered my conclusion on the features of popular movies.

Thank you for reading. If the article was interesting to you follow for more.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gilbert Adikankwu

Data analyst with 3years+ of domain knowledge in Sales Operations Data. I talk about data analysis and visualization. #Python #Excel