Published in


IMDB DataSet Visualization & Data Analytics Using Pandas

import pandas as pd
import matplotlib.pyplot as plt
imdb_1000_data_url = r’data/imdb_1000.csv’
movies = pd.read_csv(imdb_1000_data_url)
movies.head() #to preview the data set
# check the number of rows and columns
# check the data type of each column
# calculate the average movie duration
Out[]: 120.97957099080695
# sort the DataFrame by duration to find the shortest and longest movies
# create a histogram of duration, choosing an “appropriate” number of bins
movies[‘duration’].plot(kind=’hist’, bins=10)
This histogram tells us that most of the movies have a duration of 100–125 minutes
# use a box plot to display that same data
The green line is the Median
# count how many movies have each of the content ratings
# use a visualization to display that same data, including a title and x and y labels
movies[[‘content_rating’,’title’]].groupby(‘content_rating’).count().plot(kind=’bar’, title=’Content Rating Visualization’)
plt.xlabel(‘Content Rating’)
plt.ylabel(‘Title Count’)
You choose title x and y label and many other properties to display data.
# convert the following content ratings to “NC-17”: X, TV-MA
# count the number of missing values in each column
# if there are missing values: examine them, then fill them in with “reasonable” values
movies[movies[‘content_rating’].isnull()][(187,649),’content_rating’] = ‘PG’[936,’content_rating’] = ‘PG-13’
# calculate the average star rating for movies 2 hours or longer,
# and compare that with the average star rating for movies shorter than 2 hours
print(‘Avg. star rating for movies 2 hours or longer: ‘, movies[movies[‘duration’] >= 120][‘star_rating’].mean(),
‘\nAvg. star rating for movies shorter than 2 hours: ‘, movies[movies[‘duration’] < 120][‘star_rating’].mean())
# use a visualization to detect whether there is a relationship between duration and star rating
movies.boxplot(column=’duration’, by=’star_rating’);
This tells us that if the movie duration is below 125 mins it’s more likely to receive a rating.
# calculate the average duration for each genre
# visualize the relationship between content rating and duration
movies.boxplot(column=’duration’, by=’content_rating’)
# determine the top rated movie (by star rating) for each genre
movies.sort_values(‘star_rating’, ascending=False).groupby(‘genre’)[‘title’,’star_rating’].first()
# check if there are multiple movies with the same title, and if so, determine if they are actually duplicates
result = movies[movies[‘title’].isin(movies[movies.duplicated([‘title’])][‘title’])]
# calculate the average star rating for each genre, but only include genres with at least 10 movies
genres = movies[‘genre’].value_counts()[movies[‘genre’].value_counts() > 10].index
# Make a function which clean the datadef repp(string):
return string.replace("[","").replace("]","").replace("u'","").replace("',",",")[:-1]
#Apply that function to every entry
movies_series = movies['actors_list'].apply(repp)
#Declare a list to store the split values
actors_list = []
for movie_actors in movies_series:
actors_list.append([e.strip() for e in movie_actors.split(',')])
#Declare a dictionary and see if the actor name key exist and then count accordingly.
actor_dict = {}
for actor in actors_list:
for a in actor:
if a in actor_dict:
actor_dict[a] +=1
actor_dict[a] = 1




DataCTW is a website of online tutorials hosted at different places and are completely free. Our aim is to provide tutorials from scratch so that anyone who is looking to learn something new can benefit from it.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store