YouTube Data Analysis Using Hadoop Tools
In Today’s world as the 4 V’s of Big data (Volume, Variety, Velocity & Veracity) are very rapidly increasing it has become a must to come up with meaningful insights in order to project and derive meanings from the data to help organizations grow business rapidly.
In this post, we will be performing a YouTube data analysis using Big Data Tools and see how to handle semi-structured data and come up with valuable inputs.
Initially, the data was scraped. You can download the dataset from here YouTube Dataset Download link. The Data consists of multiple attributes and the Data Definition is as follows:
- Column 1: video id
- Column 2: uploader
- Column 3: age
- Column 4: category
- Column 5: length
- Column 6: views
- Column 7: rate
- Column 8: rating
- Column 9: comments
- Column 10–29: related_id strings
For this particular use case, the analysis has been done on Cloudera CDH on virtual machine (Virtualbox). It can also be done on Vmware. The data is been imported from Windows to Cloudera.
If Cloudera installed and if using Virtualbox you can download an extension pack through which you can move data from Windows to your Virtual Machine using your USB.
The link is as follows:
Extension Pack for Virtualbox.
After the data is downloaded we are ready to perform our analysis and come up with some insights about data. We would be using Apache Pig & Apache Hive for our analysis.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turn enables them to handle very large data sets.
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage. A command-line tool and JDBC driver are provided to connect users to Hive.
Initially, once the data is been loaded into your local machine (Virtual Machine) open up the terminal and Import .txt file from local to HDFS. The command is given below also make sure that you provide the correct path of your file. In our case, the .txt file is at Desktop(Cloudera).
open terminal and type:
hdfs dfs -ls
hdfs dfs -copyFromLocal ‘youtube.txt’
I. Start Pig shell: open terminal and type pig the pig shell starts here & you can start writing your pig script.
II. I. Load txt file into pig shell. The command is as follows :
youtube = load 'youtube.txt' using PigStorage('\t') as (videoid:chararray,uploader:chararray,age:int,category:chararray,length:int,views:int,rate:int,rating:int,comments:int,related_id:chararray);
III. Applying Transformation into the data.
- Find the top 10 rated videos in each category.
s1 = foreach youtube generate videoid,category,rating;s2 = order s1 by rating;s2 = order s1 by rating desc;s3 = limit s2 10;dump s3;
2. Find the top 10 most viewed videos in each category
s4 = foreach youtube generate videoid,category,views;
s5 = order s4 by views;
s5 = order s by views desc;
s6 = limit s5 10;
3. Find the sum of video ratings, views, and comments in each category.
s7=group youtube by category;
s8 = foreach s7 generate group, SUM(youtube.rating), SUM(youtube.views), SUM(youtube.comments);
IV. Store output of the above script into a local directory in HDFS:
STORE s3 into ‘hdfs/localhost/Pig_Data/Top_Rated’ USING PigStorage(‘,’)
STORE s6 into ‘hdfs/localhost/Pig_Data/Top_Views’ USING PigStorage(‘,’)
STORE s8 into ‘hdfs/localhost/Pig_Data/Sum_rate_view_cat’ USING PigStorage(‘,’)
V. Use get command to confirm if pig out is stored in the local directory
hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Rated/part-r-00000hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Views/part-r-00000hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Sum_rate_view_comment/part-r-00000
VI. Import pig output files in the hive for analysis, so create tables in the hive:
Create table top_rated(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
Create table top_views(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
Create table sum_rate_view_comment(videoid string, rating int) row format delimited fields terminated by ‘:’ ;
VII. Load data into these created tables from the local directory in HDFS where the output files were stored by giving the correct path:
Load data inpath ‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Rated/part-r-00000’ into table top_rated;Load data inpath ‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Top_Views/part-r-00000’ into table top_views;Load data inpath‘hdfs://quickstart.cloudera:8020/user/cloudera/hdfs/localhost/Pig_Data/Sum_rate_view_comment/part-r-00000’ into table sum_rate_view_commment;
VIII. Now view the pig output in the hive by performing the following queries:
select * from top_rated
select * from top_views
select * from sum_rate_view_comment
Execute the above queries one at a time. The output for the 3rd query is as follows :
Once the Data is in Hive Database you may even go ahead and perform the visualizations for the above tables in Data Visualization tools such as Power BI & Tableau. We have visualized the above tables in Power BI. In order to connect Hive Database to Power BI please refer to the following link :
The above blog has the details to connect your Hive Database to Power BI.
Here are few visualizations in PowerBI which I performed
You can also read my other articles here: —
Rahul Pathak - Medium
Read writing from Rahul Pathak on Medium. Writer|writing about technology, self-improvement| Masters in Data…
Thank you very much if any feedback please do let me know! :)