Youtube Trending Videos Analysis
Okay! Let’s do this…
So, I prepared this story to familiarize myself a bit more with text mining and visualizations.
The dataset is available in the following link:
For this specific analysis I chose to work with Canadian data.
Go Raptors!! 🏀🏀
Let’s Start!
So the first thing was to separate both csv and JSON file for the Canada data.
The csv files contain the following 16 columns:

As we will see later on the JSON file will be used to get the corresponding Category name for each “Category_id”.
After we load the data, the first thing to notice are the columns data type. Both trending_date and publish_time shall be converted to a time type data.


Let’s visualize the summary of the data.

The data appears to be pretty clean, with no NaNs or odd negative values.🙏🏻
Let’s clean up a bit, removing columns that we are not going to use.

Okay. So now that our data is a bit more presentable let’s begin some EDA to explore a bit more interesting stories within the data.
The first thing I’m interested on is to see how trending videos views, likes, and number change according to categories and time.
So to do this we first aggregate the data using dplyr library in R.


We now save this new data to work on some visualizations later.

This is the trending_video data we have so far:

So… remember I talked about a JSON file earlier? Well, we are going to use it now to get the actual names for each category instead of working with the category_id as seen above.

If we see the structure of the JSON file we’ll see lists with embedded dataframes which can be sometimes a difficult to grasp in the beginning

Well the good news here is that we are only focusing in the ID and title, so we are going to be ok..! (for now 😬..jk)
Anyway, we use the flatten() function which takes nested dataframes inside dataframes and transform it into a single dataframes. (bit confusing..)


Now we select the columns that we are interested on, which ate id and snippet.title, and save this new data as a csv file.

Ok. So now we have two csv files and we want to perform a left-join where category_id = id. This operation can be done using many different tools, such as: SQL, Hive, R, Python, Tableau etc. For this case I used Tableau, just because I will used it also to plot some visualizations faster.

Data Visualizations on Trending Videos by Category
Let’s begin by visualizing the number of trending videos per category, per month.


When you take in account that the first and last month weren’t all computed, since the first date in the data is 2017–11–14 and the last is 14–06–2018, the average number of trending videos per month is around 5.8k.
It’s also very evident that entertainment category have the biggest number of trending videos. (I know right.. shocking!!)
Now let’s visualize the average monthly views per category



So, although Entertainment has the biggest number of trending videos per month, the Music category holds the biggest average of views per month.
Now let’s focus a bit more on likes. As we can see the distribution of likes for categories follows a similar pattern than the distribution of views above.

The bar plot bellow shows the average of views per like for each of the categories.

Comedy has the lowest rate of views per like (20.7), while Shows need an average of 162.4 views for each like given.
A Bit of Text Mining
Well, sometimes our data is not always structured (such as tables in a csv file), and sometimes we are more interested in data such as texts. Text mining focus on techniques to obtain information from such data.
In this story i will not focus in semantics techniques, mostly because I think that there are already others metrics, such as likes, dislikes and views that can represent most of the ideas I wanted to show here. Instead, I will focus more on simple word counts to understand what where the most common words in describing the trending videos.
Okay so let’s start by uploading the libraries we would use.

So as you all can see, I’m focusing on the Tags feature. I’m using tidytext library to do the text mining, it is very simple and if you’r interested you can look up the following link:
Following the command above we have a table like this:

As you all can see, some of the are accented characters that are not from the English language. Since we will be focusing on English tags we will remove those words.

The first block of code introduces a new column called “special_char” where we use a REGEX command to indicate if a accented character is present.
In the second block we filter out the words we want and remove the column “special_char”.
Just a quick note: Although the column is named “Words” we are actually talking about tokens. Tokens are meaningful units of text, so they can be a word, a number, a root of a word. It will all depend on your data and what you want to analyse.
OK! Now that we remove accented characters, we also want to remove “stop_words”, i.e words that are commonly used but do not necessarily bring any meaning. Some examples are: a, the, above, all, almost, else, etc, every, for, has, I.

Finally we want to count the words that repeat for each category and then get the top 10 words.

All set, let’s now plot some visualizations. Again i’m using Tableau but you can do the same using R, Python, or other program.


Now let’s visualize how the Top TAGS changed over each month. For that we are going to use a similar process to the one we did before.




Final Considerations
Well, the idea here was to get familiarized with text mining and visualizations techniques. From what we analyzed, some interesting information can be obtained:
- The median of the Average views for trending videos is around 940K views;
- Entertainment is the category with the most number of trending videos, but it is fifth in the number of views;
- Comedy, Music and Gaming need less views to receive likes;
- Indian research themes are very relevant in this data, with words such as Punjabi and sridevi ;
- In March 2018 the top tag was “sridevi” due to famous actress’ death at the end of February 2018 (Link- https://en.wikipedia.org/wiki/Sridevi)
- In June 2018 the top tag was “Roseanne” most probably due to the canceling of the show on May 29, 2018 (Link- https://en.m.wikipedia.org/wiki/Roseanne)
Well that’s it for now. There is a bunch of other stuff we can do with the data, but ..
Thanks again!
