Data Cleaning and Exploratory Data Analysis with Pandas on Trending Youtube Video Statistics
This article is written as a log of our work for our Data Science project. Our project was aimed at analyzing the daily YouTube trending videos data sets that are available here:
https://www.kaggle.com/datasnaek/youtube-new
YouTube maintains a list of the top trending videos on the platform. This data set includes several months of data on daily trending YouTube videos. Data is included for the USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, India and Japan. This list is determined by using user interactions such as views, comments and likes to identify which videos are user preferred and displays them on the trending page. Ranking of these videos is done based on a ratio of views, likes, comments and shares, in order to display the best videos at the top of the page. We used the data sets available and began the process of data cleaning followed by Exploratory data analysis (EDA). This article will be divided into two sessions as follows:
Data Cleaning
Exploratory Data Analysis
- Ratio of likes-dislikes in different categories?
- Users like videos from which category more?
- Top 5 videos that are on trending in each country?
- Is the most liked video also the most trending video?
- Maximum number of days to trending status for a video?
- Users like videos from which category more?
- Users comment on which category the most?
- Frequently occurring words in tags and description
- and more …
Data Cleaning
Importing Libraries
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
import json
import pickle
import six
sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None
Importing all the CSV files
AllCSV = [i for i in glob.glob('*.{}'.format('csv'))]
AllCSV
Reading all CSV files
all_dataframes = [] # list to store each data frame separately
for csv in AllCSV:
df = pd.read_csv(csv)
df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
all_dataframes.append(df)all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets
Fixing Data Types
First part of the data cleaning process was to fix the data types of all the columns in order to make them easier to manipulate and be more manageable. It should be noted that for several columns the data type was changed to strings, when the data types are displayed, they show up as objects as strings are a type of object in pandas.
for df in all_dataframes:
# video_id
df['video_id'] = df['video_id'].astype('str')
# trending date
df['trending_date'] = df['trending_date'].astype('str')
date_pieces = (df['trending_date']
.str.split('.')
) df['Year'] = date_pieces.str[0].astype(int)
df['Day'] = date_pieces.str[1].astype(int)
df['Month'] = date_pieces.str[2].astype(int) updatedyear = []
for i in range(len(df)) :
y = df.loc[i, "Year"]
newy = y+2000
updatedyear.append(newy) for i in range(len(df)):
newy = updatedyear[i]
tr = df.loc[i, "Year"]
df['Year'].replace(to_replace = tr, value = newy, inplace=True) del df['trending_date']
df['trending_date'] = pd.to_datetime(df[['Year', 'Month', 'Day']], format = "%Y-%m-%d")
del df['Year']
del df['Day']
del df['Month']
#title
df['title'] = df['title'].astype('str') #channel_title
df['channel_title'] = df['channel_title'].astype('str') #category_id
df['category_id'] = df['category_id'].astype(str)
#tags
df['tags'] = df['tags'].astype('str')
# views, likes, dislikes, comment_count are already in correct data types i.e int64
#thumbnail_link
df['thumbnail_link'] = df['thumbnail_link'].astype('str')
#description
df['description'] = df['description'].astype('str')
# Changing comments_disabled, ratings_disabled, video_error_or_removed from bool to categorical
df['comments_disabled'] = df['comments_disabled'].astype('category')
df['ratings_disabled'] = df['ratings_disabled'].astype('category')
df['video_error_or_removed'] = df['video_error_or_removed'].astype('category')
# publish_time
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')
Separating ‘publish_time’ into ‘publish_date’ and ‘publish_time’
for df in all_dataframes:
df.insert(4, 'publish_date', df['publish_time'].dt.date) # loc, column name, values for column to be inserted
df['publish_time'] = df['publish_time'].dt.time# Changing data type for 'publish_date' from object to 'datetime64[ns]'for df in all_dataframes:
df['publish_date'] = pd.to_datetime(df['publish_date'], format = "%Y-%m-%d")
Now a quick look at the updated data types!
# We can use any index from 0 to 9 inclusive (for each of the 10 dataframesall_dataframes[1].dtypes
For the index, we chose video_id
for df in all_dataframes:
df.set_index('video_id', inplace=True)
Examining Missing Values
Missing values are necessary to handle as they can reduce the statistical power of the data set and lead to bias thus resulting in invalid conclusions. The missing data can be handled by either removing the respective tuple, which is often done in cases of non_numeric data. Otherwise, we can impute the missing data by either taking the mean or median of the data set and replacing the missing value with either.
We did this using a heat-map, where any missing value in a column would appears as an orange square against the black background of the heat-map. As you can see from one of the screenshots from the notebook below, no data set had any missing values, thus there was no handling necessary.
for df in all_dataframes:
sns.heatmap(df.isnull(), cbar=False)
plt.figure()
Combining Every Dataframe Into One Huge Dataframe
We combined all the cleaned data sets into one massive data set in order to perform EDA, as all necessary operations can be performed on this unified, clean data set without any issues.
combined_df = pd.concat(all_dataframes)
Next, we decided to further clean and refine the data by sorting the entries of the data set by trending_date. This would result in the latest trending videos to be moved to the top of the data set. This was done so that we can view the current trends of the trending videos of each country, as they are more relevant to our project.
Before we did so, however, we created a duplicate copy of our data frame. We did this as a safety precaution and to keep a copy of the original data frame at hand as we also decided to remove any duplicate video entries while sorting the videos from the other data frame.
# Making copy of original dataframe
backup_df = combined_df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Sorting according to latest trending date while removing duplicates
combined_df = combined_df.reset_index().sort_values('trending_date', ascending=False).drop_duplicates('video_id',keep='first').set_index('video_id')# Doing the same above operation for each of the individual dataframes in the list we created earlier
for df in all_dataframes:
df = df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Printing results
combined_df[['publish_date','publish_time','trending_date', 'country']].head()# It can be seen that latest publications and trending information is at the top now
Inserting Category Column
One of our final steps for the data cleaning of the data sets was checking the JSON files that were available with the data sets. We needed to see whether or not these files contained any useful data. As there were multiple files, we decided to read two files at random, in order to check whether they contained the same data or were they all containing different data.
# read file
with open('US_category_id.json', 'r') as f: # reading one randomly selected json files to make sense of its contents
data = f.read()# parse file
obj = json.loads(data)# printing
obj
One of the other randomly selected JSON file had similar data. Each of the JSON file contains id ranging from 1 to 44 (both inclusive). And with each id is given its category and other information related to title, kind etc. Hence, we can use any one of the JSON files to map category to category id in our data frame.
category_id = {}with open('DE_category_id.json', 'r') as f:
d = json.load(f)
for category in d['items']:
category_id[category['id']] = category['snippet']['title']combined_df.insert(2, 'category', combined_df['category_id'].map(category_id))
backup_df.insert(2, 'category', backup_df['category_id'].map(category_id))for df in all_dataframes:
df.insert(2, 'category', df['category_id'].map(category_id))# Printing cleaned combined dataframe
combined_df.head(3)
combined_df['category'].unique()
Thus, we cleaned up and refined our data sets into a finalized data frame, ready to be used for the upcoming EDA section of the project. We pickled both the finalized data frame and a copy of the original cleaned data frame into files, ready for use.
Exploratory Data Analysis
Ratio of likes-dislikes in different categories
We began by looking at the ratio of likes and dislikes on a trending video, depending on the category. The higher the ratio, the more the content of the video was liked by the users. As you can see from the resulting output below:
# calculating total likes for each category
likesdf = combined_df.groupby('category')['likes'].agg('sum')# calculating total dislikes for each category
dislikesdf = combined_df.groupby('category')['dislikes'].agg('sum')# calculating ratios of likes to dislikes
ratiodf = likesdf/dislikesdf # most liked category to appear on top
ratiodf = ratiodf.sort_values(ascending=False).reset_index()# plotting bar chart
ratiodf.columns = ['category','ratio']
plt.subplots(figsize=(10, 15))
sns.barplot(x="ratio", y="category", data=ratiodf,
label="Likes-Dislikes Ratio", color="b")
Observations:
We see that videos belonging to the pets and animals categories have the highest ratio of likes to dislikes videos among the trending categories whereas new and politics videos have the least. From this we can infer that people are less divided on the content of videos based on entertainment than compared to topics such as new, whose content can lead to a division of opinions among the user.
Users like videos from which category more?
Next, we moved on to observe which categories were more preferred by the users, based on the likes on the videos in each category. Three of the outputs were randomly selected from the ten plotted graphs and results are as follows:
# Getting names of all countries
countries = []
allcsv = [i for i in glob.glob('*.{}'.format('csv'))]
for csv in allcsv:
c = csv[0:2]
countries.append(c)for country in countries:
if country == 'US':
tempdf = combined_df[combined_df['country']==country]['category'].value_counts().reset_index()
ax = sns.barplot(y=tempdf['index'], x=tempdf['category'], data=tempdf, orient='h')
plt.xlabel("Number of Videos")
plt.ylabel("Categories")
plt.title("Catogories of trend videos in " + country)
else:
tempdf = combined_df[combined_df['country']==country]['category'].value_counts().reset_index()
ax = sns.barplot(y=tempdf['index'], x=tempdf['category'], data=tempdf, orient='h')
plt.xlabel("Number of Videos")
plt.ylabel("Categories")
plt.title("Catogories of trend videos in " + country)
plt.figure()
Observations
Apart from RU and GB, category most liked by the users in each of the other countries is ‘Entertainment’.
Viewers from RU prefer the category ‘People and Blogs’ the most.
Viewers from GB prefer the category ‘Music’ the most.
Categories ‘Shows’, ‘Movies’, and ‘Trailers’ were the least liked ones in almost all of the countries.
Top 5 videos that are on trending in each country?
Now, as we know the most preferred category from user in each country, we decided to see what were the top 5 videos on the trending page from each country. As done earlier, we selected 3 random results out of a total of 10 results to be displayed in our article and they are as follows:
temporary = []
for df in all_dataframes:
temp = df
temp = temp.reset_index().sort_values(by = ['views'], ascending=False)
temp.drop_duplicates(subset ="video_id", keep = 'first', inplace = True)
temp.set_index('video_id', inplace=True)
temp = temp.head(5) # top 5 that are on trending
temporary.append(temp)# Printing 3 randomly selected countries
temporary[2][['title', 'channel_title', 'category', 'views', 'likes']]
Observations
Users from every country mostly prefer videos belonging to the categories of ‘Music’ and ‘Entertainment’, potentially meaning users usually use the platform for recreational purposes in comparisons to other uses.
Is the most liked video also the most trending video?
Knowing the preference by country for the trending videos, a question arose: Is the most trending video in that country also the video with the most likes?
temporary = [] # to store results for each country
for df in all_dataframes:
temp = df
temp = temp.reset_index().sort_values(by = ['likes'], ascending=False)
temp.drop_duplicates(subset ="video_id", keep = 'first', inplace = True)
temp.set_index('video_id', inplace=True)
temp = temp.head(5) # top 5 that are most liked
temporary.append(temp)# Printing 3 randomly selected results
temporary[0][['views', 'likes']]
Observation
Although it may not seem easily visible, we concluded that most liked video is not the most trending one. This was further proven when we continued deeper into the data set and supporting information, we came to the conclusion that the most trending video is often the most viewed video (in combination with several other factors).
Maximum number of days to trending status for a video?
We know that all the videos available in the data sets are part of the trending page but how long did it take for a video, on average, to reach the trending page starting from the day the video was published?
# Calculating days between publish and trending date
temporary = []for data in all_dataframes:
temp = data
temp['timespan'] = (temp['trending_date'] - temp['publish_date']).dt.days
temporary.append(temp)# Plotting
to_trending = temporary[0].sample(1000).groupby('video_id').timespan.max() # CA
sns_ax = sns.boxplot(y = to_trending)
_ = sns_ax.set(yscale = "log")
plt.show()
_ = sns.distplot(to_trending.value_counts(),bins='rice',kde=False)
Observation
As we can see from both plots above, most videos take less a 100 days to reach the trending page. This can be attributed to viral natural of content on the internet, that if some online content is popular, it will often be shared and viewed within it’s short time span of relevancy.
Users like videos from which category more?
We also decided to check which category overall had the most likes from all the categories.
temp = combined_df
temp = temp.groupby('category')['views', 'likes'].apply(lambda x: x.astype(int).sum())
temp = temp.sort_values(by='likes', ascending=False).head()
temp
Observation
As we can see, most liked category is ‘Entertainment’ for all countries. This shows user preference to use YouTube as an entertainment platform.
Users comment on which category the most?
Similarly we decided to explore which category overall gets the most comments.
temp = combined_df
temp = temp.groupby('category')['views','likes', 'comment_count'].apply(lambda x: x.astype(int).sum())
temp = temp.sort_values(by='comment_count', ascending=False).head()
temp
Frequently occurring words in tags and description
Next, we decided to move onto the tags often attached to the video and video descriptions. Tags are often attached to videos to make them easier to search up and to include them into their relevant categories. Descriptions are written to by the up-loaders, often either giving a brief description of the video or adding relevant links. We decided to check whether there were any common words present in the tags and descriptions. For this purpose, we used a word cloud, where the most often appearing words appear the largest.
# For tags
plt.figure(figsize = (15, 15))
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color = 'black',
stopwords = stopwords,
max_words = 1000,
max_font_size = 120,
random_state = 42
).generate(str(combined_df['tags']))
plt.imshow(wordcloud)
plt.title('WORD CLOUD for Tags', fontsize = 20)
plt.axis('off')
plt.show()
# For description
plt.figure(figsize = (15, 15))
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color = 'black',
stopwords = stopwords,
max_words = 1000,
max_font_size = 120,
random_state = 42
).generate(str(combined_df['description']))
plt.imshow(wordcloud)
plt.title('WORD CLOUD for Description', fontsize = 20)
plt.axis('off')
plt.show()
Correlation between views, likes, dislikes, and comments
Next, we decided to see whether or there was any correlation between views, likes, dislikes, and comments.
col = ['views', 'likes', 'dislikes', 'comment_count']
corr = combined_df[col].corr()
corr
Observation
When evaluating the correlation between all the variables, the correlation of each feature with itself is also included, which is always 1, so that is why the graph always has the red diagonal from the upper left to the lower right.
We can see from the results, there is a positive relation between views and likes, likes and comment_count, dislikes and comment_count. We came to this conclusion as anytime the correlation coefficient, denoted as corr, is greater than zero, it’s a positive relationship.