Data Cleaning and Exploratory Data Analysis with Pandas on Trending Youtube Video Statistics

12 min readNov 15, 2019

This article is written as a log of our work for our Data Science project. Our project was aimed at analyzing the daily YouTube trending videos data sets that are available here:

https://www.kaggle.com/datasnaek/youtube-new

YouTube maintains a list of the top trending videos on the platform. This data set includes several months of data on daily trending YouTube videos. Data is included for the USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, India and Japan. This list is determined by using user interactions such as views, comments and likes to identify which videos are user preferred and displays them on the trending page. Ranking of these videos is done based on a ratio of views, likes, comments and shares, in order to display the best videos at the top of the page. We used the data sets available and began the process of data cleaning followed by Exploratory data analysis (EDA). This article will be divided into two sessions as follows:

Data Cleaning

Exploratory Data Analysis

Ratio of likes-dislikes in different categories?
Users like videos from which category more?
Top 5 videos that are on trending in each country?
Is the most liked video also the most trending video?
Maximum number of days to trending status for a video?
Users like videos from which category more?
Users comment on which category the most?
Frequently occurring words in tags and description
and more …

Data Cleaning

Importing Libraries

import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
import json
import pickle
import six
sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None

Importing all the CSV files

AllCSV = [i for i in glob.glob('*.{}'.format('csv'))]
AllCSV

Reading all CSV files

all_dataframes = [] # list to store each data frame separately
for csv in AllCSV:
    df = pd.read_csv(csv)
    df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
    all_dataframes.append(df)all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets

Fixing Data Types

First part of the data cleaning process was to fix the data types of all the columns in order to make them easier to manipulate and be more manageable. It should be noted that for several columns the data type was changed to strings, when the data types are displayed, they show up as objects as strings are a type of object in pandas.

for df in all_dataframes:
    # video_id 
    df['video_id'] = df['video_id'].astype('str') 
    
    # trending date
    df['trending_date'] = df['trending_date'].astype('str') 
    date_pieces = (df['trending_date']
                   .str.split('.')
                  )    df['Year'] = date_pieces.str[0].astype(int)
    df['Day'] = date_pieces.str[1].astype(int)
    df['Month'] = date_pieces.str[2].astype(int)    updatedyear = []
    for i in range(len(df)) : 
        y = df.loc[i, "Year"]
        newy = y+2000
        updatedyear.append(newy)    for i in range(len(df)):
        newy = updatedyear[i]
        tr = df.loc[i, "Year"]
        df['Year'].replace(to_replace = tr, value = newy, inplace=True)    del df['trending_date']
    df['trending_date'] = pd.to_datetime(df[['Year', 'Month', 'Day']], format = "%Y-%m-%d")
    del df['Year']
    del df['Day']
    del df['Month']
    
    #title
    df['title'] = df['title'].astype('str')    #channel_title
    df['channel_title'] = df['channel_title'].astype('str')    #category_id
    df['category_id'] = df['category_id'].astype(str) 
    
    #tags
    df['tags'] = df['tags'].astype('str')
    
    # views, likes, dislikes, comment_count are already in correct data types i.e int64
    
    #thumbnail_link
    df['thumbnail_link'] = df['thumbnail_link'].astype('str') 
    
    #description
    df['description'] = df['description'].astype('str')
    
    # Changing comments_disabled, ratings_disabled, video_error_or_removed from bool to categorical
    df['comments_disabled'] = df['comments_disabled'].astype('category') 
    df['ratings_disabled'] = df['ratings_disabled'].astype('category') 
    df['video_error_or_removed'] = df['video_error_or_removed'].astype('category') 
    
    # publish_time 
    df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')

Separating ‘publish_time’ into ‘publish_date’ and ‘publish_time’

for df in all_dataframes:
    df.insert(4, 'publish_date', df['publish_time'].dt.date) # loc, column name, values for column to be inserted
    df['publish_time'] = df['publish_time'].dt.time# Changing data type for 'publish_date' from object to 'datetime64[ns]'for df in all_dataframes:
     df['publish_date'] = pd.to_datetime(df['publish_date'], format = "%Y-%m-%d")

Now a quick look at the updated data types!

# We can use any index from 0 to 9 inclusive (for each of the 10 dataframesall_dataframes[1].dtypes

For the index, we chose video_id

for df in all_dataframes:
    df.set_index('video_id', inplace=True)

Examining Missing Values

Missing values are necessary to handle as they can reduce the statistical power of the data set and lead to bias thus resulting in invalid conclusions. The missing data can be handled by either removing the respective tuple, which is often done in cases of non_numeric data. Otherwise, we can impute the missing data by either taking the mean or median of the data set and replacing the missing value with either.

We did this using a heat-map, where any missing value in a column would appears as an orange square against the black background of the heat-map. As you can see from one of the screenshots from the notebook below, no data set had any missing values, thus there was no handling necessary.

for df in all_dataframes:
    sns.heatmap(df.isnull(), cbar=False)
    plt.figure()

Combining Every Dataframe Into One Huge Dataframe

We combined all the cleaned data sets into one massive data set in order to perform EDA, as all necessary operations can be performed on this unified, clean data set without any issues.

combined_df = pd.concat(all_dataframes)

Next, we decided to further clean and refine the data by sorting the entries of the data set by trending_date. This would result in the latest trending videos to be moved to the top of the data set. This was done so that we can view the current trends of the trending videos of each country, as they are more relevant to our project.

Before we did so, however, we created a duplicate copy of our data frame. We did this as a safety precaution and to keep a copy of the original data frame at hand as we also decided to remove any duplicate video entries while sorting the videos from the other data frame.

# Making copy of original dataframe
backup_df = combined_df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Sorting according to latest trending date while removing duplicates
combined_df = combined_df.reset_index().sort_values('trending_date', ascending=False).drop_duplicates('video_id',keep='first').set_index('video_id')# Doing the same above operation for each of the individual dataframes in the list we created earlier
for df in all_dataframes:
    df = df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Printing results
combined_df[['publish_date','publish_time','trending_date', 'country']].head()# It can be seen that latest publications and trending information is at the top now

Inserting Category Column

One of our final steps for the data cleaning of the data sets was checking the JSON files that were available with the data sets. We needed to see whether or not these files contained any useful data. As there were multiple files, we decided to read two files at random, in order to check whether they contained the same data or were they all containing different data.

# read file
with open('US_category_id.json', 'r') as f:  # reading one randomly selected json files to make sense of its contents
    data = f.read()# parse file
obj = json.loads(data)# printing
obj

One of the other randomly selected JSON file had similar data. Each of the JSON file contains id ranging from 1 to 44 (both inclusive). And with each id is given its category and other information related to title, kind etc. Hence, we can use any one of the JSON files to map category to category id in our data frame.

category_id = {}with open('DE_category_id.json', 'r') as f:
    d = json.load(f)
    for category in d['items']:
        category_id[category['id']] = category['snippet']['title']combined_df.insert(2, 'category', combined_df['category_id'].map(category_id))
backup_df.insert(2, 'category', backup_df['category_id'].map(category_id))for df in all_dataframes:
    df.insert(2, 'category', df['category_id'].map(category_id))# Printing cleaned combined dataframe
combined_df.head(3)

First half of the cleaned combined dataframe

Second half of the cleaned combined dataframe

combined_df['category'].unique()

Thus, we cleaned up and refined our data sets into a finalized data frame, ready to be used for the upcoming EDA section of the project. We pickled both the finalized data frame and a copy of the original cleaned data frame into files, ready for use.

Exploratory Data Analysis

Ratio of likes-dislikes in different categories

We began by looking at the ratio of likes and dislikes on a trending video, depending on the category. The higher the ratio, the more the content of the video was liked by the users. As you can see from the resulting output below:

# calculating total likes for each category
likesdf = combined_df.groupby('category')['likes'].agg('sum')# calculating total dislikes for each category
dislikesdf = combined_df.groupby('category')['dislikes'].agg('sum')# calculating ratios of likes to dislikes
ratiodf = likesdf/dislikesdf # most liked category to appear on top
ratiodf = ratiodf.sort_values(ascending=False).reset_index()# plotting bar chart
ratiodf.columns = ['category','ratio']
plt.subplots(figsize=(10, 15))
sns.barplot(x="ratio", y="category", data=ratiodf,
            label="Likes-Dislikes Ratio", color="b")

Ratio of likes-dislikes in different categories

Observations:

We see that videos belonging to the pets and animals categories have the highest ratio of likes to dislikes videos among the trending categories whereas new and politics videos have the least. From this we can infer that people are less divided on the content of videos based on entertainment than compared to topics such as new, whose content can lead to a division of opinions among the user.

Users like videos from which category more?

Next, we moved on to observe which categories were more preferred by the users, based on the likes on the videos in each category. Three of the outputs were randomly selected from the ten plotted graphs and results are as follows:

# Getting names of all countries
countries = []
allcsv = [i for i in glob.glob('*.{}'.format('csv'))]
for csv in allcsv:
    c = csv[0:2]
    countries.append(c)for country in countries:
    if country == 'US':
        tempdf = combined_df[combined_df['country']==country]['category'].value_counts().reset_index()
        ax = sns.barplot(y=tempdf['index'], x=tempdf['category'], data=tempdf, orient='h')
        plt.xlabel("Number of Videos")
        plt.ylabel("Categories")
        plt.title("Catogories of trend videos in " + country)
    else:
        tempdf = combined_df[combined_df['country']==country]['category'].value_counts().reset_index()
        ax = sns.barplot(y=tempdf['index'], x=tempdf['category'], data=tempdf, orient='h')
        plt.xlabel("Number of Videos")
        plt.ylabel("Categories")
        plt.title("Catogories of trend videos in " + country)
        plt.figure()

Observations

Apart from RU and GB, category most liked by the users in each of the other countries is ‘Entertainment’.
Viewers from RU prefer the category ‘People and Blogs’ the most.
Viewers from GB prefer the category ‘Music’ the most.
Categories ‘Shows’, ‘Movies’, and ‘Trailers’ were the least liked ones in almost all of the countries.

Is the most liked video also the most trending video?

Knowing the preference by country for the trending videos, a question arose: Is the most trending video in that country also the video with the most likes?

temporary = [] # to store results for each country
for df in all_dataframes:
    temp = df
    temp = temp.reset_index().sort_values(by = ['likes'], ascending=False)
    temp.drop_duplicates(subset ="video_id", keep = 'first', inplace = True)
    temp.set_index('video_id', inplace=True)
    temp = temp.head(5) # top 5 that are most liked
    temporary.append(temp)# Printing 3 randomly selected results
temporary[0][['views', 'likes']]

Observation

Although it may not seem easily visible, we concluded that most liked video is not the most trending one. This was further proven when we continued deeper into the data set and supporting information, we came to the conclusion that the most trending video is often the most viewed video (in combination with several other factors).

Maximum number of days to trending status for a video?

We know that all the videos available in the data sets are part of the trending page but how long did it take for a video, on average, to reach the trending page starting from the day the video was published?

# Calculating days between publish and trending date
temporary = []for data in all_dataframes:
    temp = data
    temp['timespan'] = (temp['trending_date'] - temp['publish_date']).dt.days
    temporary.append(temp)# Plotting
to_trending = temporary[0].sample(1000).groupby('video_id').timespan.max() # CA
sns_ax = sns.boxplot(y = to_trending)
_ = sns_ax.set(yscale = "log")
plt.show()
_ = sns.distplot(to_trending.value_counts(),bins='rice',kde=False)

Maximum number of days to trending status

Observation

As we can see from both plots above, most videos take less a 100 days to reach the trending page. This can be attributed to viral natural of content on the internet, that if some online content is popular, it will often be shared and viewed within it’s short time span of relevancy.

Users like videos from which category more?

We also decided to check which category overall had the most likes from all the categories.

temp = combined_df
temp = temp.groupby('category')['views', 'likes'].apply(lambda x: x.astype(int).sum())
temp = temp.sort_values(by='likes', ascending=False).head()
temp

Observation

As we can see, most liked category is ‘Entertainment’ for all countries. This shows user preference to use YouTube as an entertainment platform.

Users comment on which category the most?

Similarly we decided to explore which category overall gets the most comments.

temp = combined_df
temp = temp.groupby('category')['views','likes', 'comment_count'].apply(lambda x: x.astype(int).sum())
temp = temp.sort_values(by='comment_count', ascending=False).head()
temp

Frequently occurring words in tags and description

Next, we decided to move onto the tags often attached to the video and video descriptions. Tags are often attached to videos to make them easier to search up and to include them into their relevant categories. Descriptions are written to by the up-loaders, often either giving a brief description of the video or adding relevant links. We decided to check whether there were any common words present in the tags and descriptions. For this purpose, we used a word cloud, where the most often appearing words appear the largest.

# For tags
    
 plt.figure(figsize = (15, 15))
 stopwords = set(STOPWORDS)
 wordcloud = WordCloud(
 background_color = 'black',
 stopwords = stopwords,
 max_words = 1000,
 max_font_size = 120,
 random_state = 42
 ).generate(str(combined_df['tags']))
 plt.imshow(wordcloud)
 plt.title('WORD CLOUD for Tags', fontsize = 20)
 plt.axis('off')
 plt.show()

# For description
    
 plt.figure(figsize = (15, 15))
 stopwords = set(STOPWORDS)
 wordcloud = WordCloud(
 background_color = 'black',
 stopwords = stopwords,
 max_words = 1000,
 max_font_size = 120,
 random_state = 42
 ).generate(str(combined_df['description']))
 plt.imshow(wordcloud)
 plt.title('WORD CLOUD for Description', fontsize = 20)
 plt.axis('off')
 plt.show()

Correlation between views, likes, dislikes, and comments

Next, we decided to see whether or there was any correlation between views, likes, dislikes, and comments.

col = ['views', 'likes', 'dislikes', 'comment_count']
corr = combined_df[col].corr()
corr

Observation

When evaluating the correlation between all the variables, the correlation of each feature with itself is also included, which is always 1, so that is why the graph always has the red diagonal from the upper left to the lower right.
We can see from the results, there is a positive relation between views and likes, likes and comment_count, dislikes and comment_count. We came to this conclusion as anytime the correlation coefficient, denoted as corr, is greater than zero, it’s a positive relationship.

Data Cleaning and Exploratory Data Analysis with Pandas on Trending Youtube Video Statistics

Data Cleaning

Importing Libraries

Importing all the CSV files

Reading all CSV files

Fixing Data Types

Examining Missing Values

Combining Every Dataframe Into One Huge Dataframe

Inserting Category Column

Exploratory Data Analysis

Ratio of likes-dislikes in different categories

Users like videos from which category more?

Top 5 videos that are on trending in each country?

Is the most liked video also the most trending video?

Maximum number of days to trending status for a video?

Users like videos from which category more?

Users comment on which category the most?

Frequently occurring words in tags and description

Correlation between views, likes, dislikes, and comments

The End

Written by Muhammad Raahim Khan