IMDB DataSet Visualization & Data Analytics Using Pandas

IMDB Logo

We have a .csv file of IMDB top 1000 movies and today we will be using this data to visualize and perform another type of analysis on it using Pandas.

Download https://github.com/thechaudharysab/imdb-data-pandas-visualization/blob/master/data/imdb_1000.csv

I’m using Jupyter Notebook to add all of this code. So. first thing’s first, let’s import Pandas and Matplot libraries. Pandas to perform data analytics and Matplot for visualization.

import pandas as pd
import matplotlib.pyplot as plt

Read in ‘imdb_1000.csv’ and store it in a DataFrame named movies

imdb_1000_data_url = r’data/imdb_1000.csv’
movies = pd.read_csv(imdb_1000_data_url)
movies.head() #to preview the data set

Let’s get the initial idea about data

# check the number of rows and columns
movies.shape
# check the data type of each column
movies.dtypes

Now that we know which column is of what data type we can perform operations on data like:

# calculate the average movie duration
movies[‘duration’].mean()

The function we can perform on the data can be:

# sort the DataFrame by duration to find the shortest and longest movies
movies.sort_values(‘duration’)

Let’s see the data in a visualization form.

# create a histogram of duration, choosing an “appropriate” number of bins
movies[‘duration’].plot(kind=’hist’, bins=10)
This histogram tells us that most of the movies have a duration of 100–125 minutes

This is one of the benefits of using visualization for data that you can easily see the difference in data. There are different kinds of visualizations in which you can display the data. It’s up to your data type as there may be a case when the histogram is not telling you the right answer to the business question you are asking.

# use a box plot to display that same data
movies[‘duration’].plot(kind=’box’)
The green line is the Median

Box plot is telling us the same thing that most of the movies have a duration somewhere from 110 to 135 and we also have a clear median. Whereas, with histogram, we were unable to tell the median clearly.

Let’s do some more intermediate data analytics and visualizations using pandas.

# count how many movies have each of the content ratings
movies[[‘content_rating’,’title’]].groupby(‘content_rating’).count()

To see this above data in a visualized form:

# use a visualization to display that same data, including a title and x and y labels
movies[[‘content_rating’,’title’]].groupby(‘content_rating’).count().plot(kind=’bar’, title=’Content Rating Visualization’)
plt.xlabel(‘Content Rating’)
plt.ylabel(‘Title Count’)
You choose title x and y label and many other properties to display data.

Here are some more analytics you can perform on the data.

# convert the following content ratings to “NC-17”: X, TV-MA
movies[‘content_rating’].replace([‘X’,’TV-MA’],’NC-17')#.head()

Let’s see if there is any missing value(s)

# count the number of missing values in each column
movies.isnull().sum(axis=0)

You can also have inline conditions on data. For example;

# calculate the average star rating for movies 2 hours or longer,
# and compare that with the average star rating for movies shorter than 2 hours

Let’s do some visualization on this

# use a visualization to detect whether there is a relationship between duration and star rating
movies.boxplot(column=’duration’, by=’star_rating’);
This tells us that if the movie duration is below 125 mins it’s more likely to receive a rating.
# calculate the average duration for each genre
movies[[‘duration’,’genre’]].groupby(‘genre’).mean()

Let’s visualize this data:

# visualize the relationship between content rating and duration
movies.boxplot(column=’duration’, by=’content_rating’)
# determine the top rated movie (by star rating) for each genre
movies.sort_values(‘star_rating’, ascending=False).groupby(‘genre’)[‘title’,’star_rating’].first()

The first keyword is used to get the first value from a list/array or any sort of storage list type.

Some more analytics this data can give.

# check if there are multiple movies with the same title, and if so, determine if they are actually duplicates
result = movies[movies[‘title’].isin(movies[movies.duplicated([‘title’])][‘title’])]
result.sort_values(‘title’)

Inline condition:

# calculate the average star rating for each genre, but only include genres with at least 10 movies
genres = movies[‘genre’].value_counts()[movies[‘genre’].value_counts() > 10].index
movies[movies[‘genre’].isin(genres)].groupby(‘genre’)[‘star_rating’].mean()

Q: Figure out which actor did how many movies

We will start by cleaning the data:

# Make a function which clean the data

You can see all of this code on this Github Repository. Feel free to perform more functions on this data set as a practice.

https://www.buymeacoffee.com/chaudhrytalha
https://www.buymeacoffee.com/chaudhrytalha
https://www.buymeacoffee.com/chaudhrytalha

Passionate about using technology for Social Impact. Let’s connect: https://www.linkedin.com/in/chtalha