It’s just a Piece of cake… Hive Partitioning & Bucketing

R RAMYA
7 min readMay 13, 2022

--

You all might be aware of sharing your food equally with your siblings… You do it by partitioning it right ???

But Why we wanna partition it ??? How it relates to Hive? Don’t worry we’ll learn about it. It’s just a piece of cake…

image

Partitions ? What’s that ?

Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys.

→ Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements for determining how the data is stored in the table.

Why we wanna do partitioning of data ?

In partitioning the data is stored in slices, the query response time becomes faster.

→ As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best approach to deal with it. The partitioning in Hive is the best example of it.

Before getting into Partitioning types, It’s good to know the concept of HIVE. Do check my blog on it “HIVE

➥ Partitioning in Hive

→ Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop.

image

▶ Types of Partitioning

Static Partitioning

➼ In static or manual partitioning, it is required to pass the values of partitioned columns manually while loading the data into the table. Hence, the data file doesn’t contain the partitioned columns.

➼ In Static Partition there is a two types.

1.Using existing column — getting in with columns that’s already present in the table.

2.Using new column — In this type of partition helps in add new use new partition column.

Existing Column:

(i) Creation of Table:

➥ Creating partition using new column instead of existing table column

Syntax:

create table pp_patient1(pid int,pname string,gender string,amt int)
partitioned by (drug string);

(ii) Describe the table:

➥Let’s retrieve the information associated with the table

desc pp_patient1;

(iii)Insert the data :

➥Inserting the data and to select drug partition that contains only ‘Para’

insert overwrite table pp_patient1 partition(drug=’Para’)
select pid,pname,gender,tot_amt from patient where drug=’Para’;

In browser → https://localhost:50070

Similarly, with drug as ‘hcq’

New column:

(i)Creating a table with new column:

➥ Creating partition using new column instead of existing table column.

(ii)Describe the table:

➥ Let’s retrieve the information associated with the table

desc pp_patient;

(iii)Insert the data :

➥ Insert the value into the table where the drug=’Para’

In browser → https://localhost:50070

◘ Let’s see the another example,

Data in the table ‘pp_patient’:

Dynamic Partitioning

⮞ In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass the values of partitioned columns manually.

⮞ Enable the dynamic partition by using the following commands:

set hive.exec.dynamic.partition.mode=nonstrict;

(i) Create a table:

create table p_patientdy(pid int,pname string,amt int)
partitioned by (drug string,gender string );

(ii) Insert the data:

insert overwrite table p_patientdy partition(drug,gender)
select pid,pname,tot_amt,drug,gender from patient ;

In Browser:

➥ Inside each partition we have another sub partition based on gender.

➥ Let’s see an example Here, inside the “drug = Para” we can able to see the sub partition.

>>sub partition

Bucketing:

➽ The bucketing in Hive is a data organizing technique. It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets.

➽ So, we can use bucketing in Hive when the implementation of partitioning becomes difficult. However, we can also divide partitions further in buckets.

Working of Bucketing in Hive:

The concept of bucketing is based on the hashing technique.

➥ Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 3).

➥ Now, based on the resulted value, the data is stored into the corresponding bucket.

(i) Create a table:

➥ Create a table to store the data.

In Browser the ‘bucket_patient1’ has been created

(ii) Enable the bucketing by using the following command:

(iii) Insert the Data:

➥ Insert the data of in the created table into the bucketed table.

In Browser:

(iv) Select the data of the table:

select * from bucket_patient1 TABLESAMPLE(BUCKET 1 OUT OF 4);

Hive Partitioning Advantages:

Partitioning in Hive distributes execution load horizontally.

In partition faster execution of queries with the low volume of data takes place. For example, search population from Vatican City returns very fast instead of searching entire world population.

Hive Partitioning Disadvantages:

There is the possibility of too many small partition creations- too many directories.

Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.

There is no need for searching entire table column for a single record.

Hope you got the concept of partition in cake, I mean Hive …..

Any Questions? Ping me…

Happy Day!!!

Cheers!!!

Ramya R…

--

--