Data analysis process using Python libraries to analyze a dataset of 10,000+ movies.
Udacity Nanodegree Project 1: Investigate a dataset.
Data analysis process consist of five key steps:
Step 1: Ask questions
Step 2: Wrangle data
Step 3: Perform EDA (Exploratory Data Analysis)
Step 4: Draw conclusions (or even make predictions)
Step 5: Communicate your results
For this project, I went through the data analysis process using Python libraries to analyze a dataset and then communicate my findings about it. The dataset contains information of about 10,000+ movies cleaned from the original TMdb movie data on kaggle. The movies in the dataset were released between the year 1960–2015 and some other features of the information in the dataset includes genre, popularity, budget and revenue.
Analysis Questions
In this article, I’ll summarily explain my data analysis process using the Python libraries numpy, Pandas, and Matplotlib with the aim to find answers for the following questions:
1. What are the features of popular movies?
2. Which genres are most popular from year to year?
Data Wrangling
Data wrangling, is the process to get the data we need in a form we can work with in three steps: gather, assess, and clean.
For the data gathering step, I downloaded the data as a CSV file from a google doc link provided by Udacity, loaded it into a pandas Dataframe. Then I assessed the data to identify any quality or structure issues. For the data cleaning step, all the identified quality and structural issues was cleaned by modifying, replacing, or removing data to ensure the dataset is of the highest quality and as well-structured as possible. Like it has been repeated by many data professionals, 80% of a data analysis project revolves around cleaning the data. Missing values are always a big issue in any data project. In this instance, it was stated by the authors of the original dataset that zeros in the budget column should be treated as missing, with the caveat that missing budgets are much more likely to have been from small budget films in the first place. You can see the detailed wrangling process on my Github.
Extracting The Genres
Data analysis is an iterative process and what that means is that even after completing the data wrangling step, if a new issue is identified during exploratory analysis you can clean it before continuing with your analysis. This iterative process became evident as I wanted to get insight for question 2 because I observed movies where categorized with multiple unique genres separated by pipe characters and I needed to consider all the genre to find out the most popular genres because if I chose just one genre for each movie the analysis would be inconsistent and biased.
Therefore, I wrote a logic code to split the grouped genre into their individual columns for the movies attached.
Exploratory Data Analysis
Q1: What are the features of popular movies?
To get insights for this question, I did the summary statistics of the popularity values of the movies in the dataset and observed from the result, that movies with popularity rating above 1 may be considered as popular given that the average is 1 and the maximum 32.98. Also, I did a scatter plot comparison of popularity ratings vs several features and observed that the runtimes of each movies showed the strongest correlation popularity.
Take note: This are just statistical observation and should not be taken as conclusions. Lack of enough data on the movie budgets hindered any viable conclusion for this analysis.
Q2: Which genres are most popular from year to year?
To achieve this insights, I used Matplotlib to plot the total number of movies categorized under each genre vs each release year.
Conclusion
The drama genre is the most popular from year to year within the timeframe release years in the dataset. From the chart we can observe that the Drama genre held the top spot as the genre with the most numbers of movies from the 1960’s to the 1980’s and only dropped briefly to second place in the early 1990’s when the Action genre had an uptick in numbers of movies but it then regained the top spot again in the late 1990’s and ever since has experienced exponential growth.
The Horror genre is the least popular genre from year to year.
The lack of sufficient data in the budget hindered my conclusion on the features of popular movies.
Thank you for reading. If the article was interesting to you follow for more.