YouTube Data Analysis Using Hadoop Tools

Data Analysis using Apache Hive & Apache Pig

Rahul Pathak
Aug 2, 2020 · 5 min read
Image for post
Image for post
Photo by William Iven on Unsplash

In Today’s world as the 4 V’s of Big data (Volume, Variety, Velocity & Veracity) are very rapidly increasing it has become a must to come up with meaningful insights in order to project and derive meanings from the data to help organizations grow business rapidly.

In this post, we will be performing a YouTube data analysis using Big Data Tools and see how to handle semi-structured data and come up with valuable inputs.

Initially, the data was scraped. You can download the dataset from here YouTube Dataset Download link. The Data consists of multiple attributes and the Data Definition is as follows:

  • Column 1: video id

For this particular use case, the analysis has been done on Cloudera CDH on virtual machine (Virtualbox). It can also be done on Vmware. The data is been imported from Windows to Cloudera.

If Cloudera installed and if using Virtualbox you can download an extension pack through which you can move data from Windows to your Virtual Machine using your USB.

The link is as follows:
Extension Pack for Virtualbox.

After the data is downloaded we are ready to perform our analysis and come up with some insights about data. We would be using Apache Pig & Apache Hive for our analysis.

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turn enables them to handle very large data sets.

Apache Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage. A command-line tool and JDBC driver are provided to connect users to Hive.

Initially, once the data is been loaded into your local machine (Virtual Machine) open up the terminal and Import .txt file from local to HDFS. The command is given below also make sure that you provide the correct path of your file. In our case, the .txt file is at Desktop(Cloudera).

open terminal and type:
hdfs dfs -ls
hdfs dfs -copyFromLocal ‘youtube.txt’

I. Start Pig shell: open terminal and type pig the pig shell starts here & you can start writing your pig script.

II. I. Load txt file into pig shell. The command is as follows :

youtube = load 'youtube.txt' using PigStorage('\t') as (videoid:chararray,uploader:chararray,age:int,category:chararray,length:int,views:int,rate:int,rating:int,comments:int,related_id:chararray);

III. Applying Transformation into the data.

  1. Find the top 10 rated videos in each category.
s1 = foreach youtube generate videoid,category,rating;s2 = order s1 by rating;s2 = order s1 by rating desc;s3 = limit s2 10;dump s3;
Image for post
Image for post
fig(1)

2. Find the top 10 most viewed videos in each category

s4 = foreach youtube generate videoid,category,views;
s5 = order s4 by views;
s5 = order s by views desc;
s6 = limit s5 10;
dump s6;
Image for post
Image for post
fig(2)

3. Find the sum of video ratings, views, and comments in each category.

s7=group youtube by category;
s8 = foreach s7 generate group, SUM(youtube.rating), SUM(youtube.views), SUM(youtube.comments);
dump s8;
Image for post
Image for post
fig(3)

IV. Store output of the above script into a local directory in HDFS:

STORE s3 into ‘hdfs/localhost/Pig_Data/Top_Rated’ USING PigStorage(‘,’)
STORE s6 into ‘hdfs/localhost/Pig_Data/Top_Views’ USING PigStorage(‘,’)
STORE s8 into ‘hdfs/localhost/Pig_Data/Sum_rate_view_cat’ USING PigStorage(‘,’)

V. Use get command to confirm if pig out is stored in the local directory

hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Rated/part-r-00000hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Views/part-r-00000hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Sum_rate_view_comment/part-r-00000
Image for post
Image for post
fig(4)

VI. Import pig output files in the hive for analysis, so create tables in the hive:

Create table top_rated(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
Create table top_views(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
Create table sum_rate_view_comment(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
Image for post
Image for post
fig(5)

VII. Load data into these created tables from the local directory in HDFS where the output files were stored by giving the correct path:

Load data inpath ‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Rated/part-r-00000’ into table top_rated;Load data inpath ‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Views/part-r-00000’ into table top_views;Load data inpath‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Sum_rate_view_comment/part-r-00000’ into table sum_rate_view_commment;
Image for post
Image for post
fig(6)

VIII. Now view the pig output in the hive by performing the following queries:

select * from top_rated
select * from top_views
select * from sum_rate_view_comment

Execute the above queries one at a time. The output for the 3rd query is as follows :

Image for post
Image for post
fig(7)

Once the Data is in Hive Database you may even go ahead and perform the visualizations for the above tables in Data Visualization tools such as Power BI & Tableau. We have visualized the above tables in Power BI. In order to connect Hive Database to Power BI please refer to the following link :

Connecting Apache Hive to Microsoft Power BI

The above blog has the details to connect your Hive Database to Power BI.

Here are few visualizations in PowerBI which I performed

Image for post
Image for post
Image for post
Image for post
Photo by Rahul Pathak on Medium

You can also read my other articles here:

Thank you very much if any feedback please do let me know! :)

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Rahul Pathak

Written by

Writer|writing about technology,self-improvement| Masters in Data Science|India|Live.Love.Laugh| contact_id:https://www.linkedin.com/in/rahulpathakmit/

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Rahul Pathak

Written by

Writer|writing about technology,self-improvement| Masters in Data Science|India|Live.Love.Laugh| contact_id:https://www.linkedin.com/in/rahulpathakmit/

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store