DataCTW
Published in

DataCTW

IMDB DataSet Visualization & Data Analytics Using Pandas

IMDB Logo

We have a .csv file of IMDB top 1000 movies and today we will be using this data to visualize and perform another type of analysis on it using Pandas.

Download https://github.com/thechaudharysab/imdb-data-pandas-visualization/blob/master/data/imdb_1000.csv

I’m using Jupyter Notebook to add all of this code. So. first thing’s first, let’s import Pandas and Matplot libraries. Pandas to perform data analytics and Matplot for visualization.

import pandas as pd
import matplotlib.pyplot as plt

Read in ‘imdb_1000.csv’ and store it in a DataFrame named movies

imdb_1000_data_url = r’data/imdb_1000.csv’
movies = pd.read_csv(imdb_1000_data_url)
movies.head() #to preview the data set

Let’s get the initial idea about data

# check the number of rows and columns
movies.shape
# check the data type of each column
movies.dtypes

Now that we know which column is of what data type we can perform operations on data like:

# calculate the average movie duration
movies[‘duration’].mean()
Out[]: 120.97957099080695

The function we can perform on the data can be:

# sort the DataFrame by duration to find the shortest and longest movies
movies.sort_values(‘duration’)

Let’s see the data in a visualization form.

# create a histogram of duration, choosing an “appropriate” number of bins
movies[‘duration’].plot(kind=’hist’, bins=10)
This histogram tells us that most of the movies have a duration of 100–125 minutes

This is one of the benefits of using visualization for data that you can easily see the difference in data. There are different kinds of visualizations in which you can display the data. It’s up to your data type as there may be a case when the histogram is not telling you the right answer to the business question you are asking.

# use a box plot to display that same data
movies[‘duration’].plot(kind=’box’)
The green line is the Median

Box plot is telling us the same thing that most of the movies have a duration somewhere from 110 to 135 and we also have a clear median. Whereas, with histogram, we were unable to tell the median clearly.

Let’s do some more intermediate data analytics and visualizations using pandas.

# count how many movies have each of the content ratings
movies[[‘content_rating’,’title’]].groupby(‘content_rating’).count()

To see this above data in a visualized form:

# use a visualization to display that same data, including a title and x and y labels
movies[[‘content_rating’,’title’]].groupby(‘content_rating’).count().plot(kind=’bar’, title=’Content Rating Visualization’)
plt.xlabel(‘Content Rating’)
plt.ylabel(‘Title Count’)
You choose title x and y label and many other properties to display data.

Here are some more analytics you can perform on the data.

# convert the following content ratings to “NC-17”: X, TV-MA
movies[‘content_rating’].replace([‘X’,’TV-MA’],’NC-17')#.head()

Let’s see if there is any missing value(s)

# count the number of missing values in each column
movies.isnull().sum(axis=0)
# if there are missing values: examine them, then fill them in with “reasonable” values
movies[movies[‘content_rating’].isnull()]
movies.at[(187,649),’content_rating’] = ‘PG’
movies.at[936,’content_rating’] = ‘PG-13’

You can also have inline conditions on data. For example;

# calculate the average star rating for movies 2 hours or longer,
# and compare that with the average star rating for movies shorter than 2 hours
print(‘Avg. star rating for movies 2 hours or longer: ‘, movies[movies[‘duration’] >= 120][‘star_rating’].mean(),
‘\nAvg. star rating for movies shorter than 2 hours: ‘, movies[movies[‘duration’] < 120][‘star_rating’].mean())

Let’s do some visualization on this

# use a visualization to detect whether there is a relationship between duration and star rating
movies.boxplot(column=’duration’, by=’star_rating’);
This tells us that if the movie duration is below 125 mins it’s more likely to receive a rating.
# calculate the average duration for each genre
movies[[‘duration’,’genre’]].groupby(‘genre’).mean()

Let’s visualize this data:

# visualize the relationship between content rating and duration
movies.boxplot(column=’duration’, by=’content_rating’)
# determine the top rated movie (by star rating) for each genre
movies.sort_values(‘star_rating’, ascending=False).groupby(‘genre’)[‘title’,’star_rating’].first()

The first keyword is used to get the first value from a list/array or any sort of storage list type.

Some more analytics this data can give.

# check if there are multiple movies with the same title, and if so, determine if they are actually duplicates
result = movies[movies[‘title’].isin(movies[movies.duplicated([‘title’])][‘title’])]
result.sort_values(‘title’)

Inline condition:

# calculate the average star rating for each genre, but only include genres with at least 10 movies
genres = movies[‘genre’].value_counts()[movies[‘genre’].value_counts() > 10].index
movies[movies[‘genre’].isin(genres)].groupby(‘genre’)[‘star_rating’].mean()

Q: Figure out which actor did how many movies

We will start by cleaning the data:

# Make a function which clean the datadef repp(string):
return string.replace("[","").replace("]","").replace("u'","").replace("',",",")[:-1]
#Apply that function to every entry
movies_series = movies['actors_list'].apply(repp)
#Declare a list to store the split values
actors_list = []
for movie_actors in movies_series:
actors_list.append([e.strip() for e in movie_actors.split(',')])
#Declare a dictionary and see if the actor name key exist and then count accordingly.
actor_dict = {}
for actor in actors_list:
for a in actor:
if a in actor_dict:
actor_dict[a] +=1
else:
actor_dict[a] = 1

actor_dict

You can see all of this code on this Github Repository. Feel free to perform more functions on this data set as a practice.

https://www.buymeacoffee.com/chaudhrytalha
https://www.buymeacoffee.com/chaudhrytalha

DataCTW is a website of online tutorials hosted at different places and are completely free. Our aim is to provide tutorials from scratch so that anyone who is looking to learn something new can benefit from it.

Recommended from Medium

Nicolas Jarry — Cristian Rodriguez Live Stream!

Online live stream search engine

Predicting New York city residential real estate prices from local venues

Closing the Digital Skills Gap

R Language for The Project Management Course — AUG University “part 3”

Where Should the Next Citi Bike Station Goes (Part II)?

Complex data migrations done well — Kapernikov

Trump Tweets Word Cloud in R

K-Means clustering and its use cases

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chaudhry Talha

Chaudhry Talha

Passionate about using technology for Social Impact. Let’s connect: https://www.linkedin.com/in/chtalha

More from Medium

Is Pandas the Only Library to Open Datasets?

Working with duplicated data in Pandas DataFrame

Pandas Cut and qCut — Converting Continuous Data to Categorical Data

Python data pre-processing