Beyond MapReduce: Navigating Hadoop with Hive

Vaishnavi Nandakumar
4 min readDec 8, 2023

--

Introduction

In my previous tutorial series, I explored the creation of Hadoop MapReduce programs in Java. The comprehensive five-part series addressed diverse problem statements, spanning from a counter application to employing distributed cache for joins. Some topics were also conceptually explored and presented in a QnA format.

In this blog post, I’ll shift the focus to Hive — a data warehouse system designed for analyzing structured data atop Hadoop. We will revisit the same problem statements, but this time, we’ll tackle them using Hive queries.

Data

The ml-100k dataset consists of 100,000 movie ratings (1–5) from 943 users on 1682 movies. You can download the dataset from here.

Movie Ratings data in u.data

User data in u.user

Movie Data in u.item

Inserting Data

Before we start implementing our queries, let us create table users to accommodate our data.

hive> create table users(id int, age int, gender string, occupation string, zipcode int) row format delimited fields terminated by '|';

Given that my local copy is stored at /Users/vaishnavink/Downloads/ml-100k/u.user , we can load data using the following query.

hive> load data local inpath '/Users/vaishnavink/Downloads/ml-100k/u.data' overwrite into table ratings;

If your data is stored in HDFS, you can load the file via

load data inpath '/user/vaishnavink/u.user' overwrite into table users;

To view your table -

Executing Queries

I have linked the MapReduce program we had written for Hadoop for each of the problem statements here.

Analyze rating popularity in movie reviews

Given that you already have a ratings table defined.

hive> select rating, count(*) from ratings group by rating order by rating;

Output

Generate a sorted list of the most popular movies

hive> select movieid, count(*) as count from ratings group by movieid order by count desc limit 10;

Output

Generate a sorted list of the most highly rated movies

hive> hive> select movieid, avg(rating) as avgRating from ratings group by movieid order by avgRating desc limit 50;

Output

Find the age distribution of the movie reviewers

hive> select age, count(*) from ratings r join users u on id = userid group by age;

Output

Let’s say you want to go one step further and create customized buckets out of the age groups.

hive> select age_bucket, count(*) from 
> (select case
> when (age <= 10) then "0-10"
> when (age > 10 and age <= 20) then "10-20"
> when (age > 20 and age <= 30) then "20-30"
> when (age > 30 and age <= 40) then "30-40"
> when (age > 40 and age <= 50) then "40-50"
> when (age >= 50) then "50+"
> end as age_bucket from users join ratings on id=userid ) as age_bucket_value group by age_bucket;

Produce a list of the top movies for every age


SELECT t1.age, t1.movieid, t2.mc, m.movietitle FROM
( SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
)t1
JOIN
(
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t2
GROUP BY age
) t2
JOIN movies m
ON t1.age=t2.age and t1.mcount=t2.mc and m.movieid=t1.movieid;

Conclusion

Having covered various examples, we’ve successfully demonstrated the translation of our MapReduce programs into HiveQL queries. I intend to add to this post by incorporating additional statements and complex queries. Feel free to suggest optimizations or offer improvements if you believe any of these can be written more effectively.

--

--