Apache Spark SQL Partitioning & Bucketing

Sandhiya M
5 min readMay 13, 2022

--

Today Let’s get to know Spark Partitioning Concepts..!!

Hello Everyone…! Are You all excited to know what it is….? Come on let’s get into it…!!!

I’ll Start with a Short Example:

➡We all might have seen an encyclopedia in our school or college library. It is a set of books that will give you information about almost anything.

➡So, Do you know what is the best thing about the encyclopedia?

➡Yes, as you guessed ,the words are arranged alphabetically. For example, if you have a word in mind say “Data Science”. You will directly go and pick up the book with the title “D”.

➡Can you imagine how tough would the task be to search for a single book if they were stored without any order?

➡Here storing the words alphabetically represents indexing, but using a different location for the words that start from the same character is known as bucketing.

Similar kinds of storage techniques like partitioning and bucketing are there in Apache spark so that we can get faster results for the search queries. In this article, we will see what is partitioning and bucketing, and when to use which one?

⦿ SO what is partitioning? and what it does.?

➊ It allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally.

Example

Partitioning like this, the data gives us performance benefits and also helps us in organizing the data. Now, let’s see when to use the partitioning in the spark.

Note:

⫸The difference between HIVE and SPARK is, Hive makes usage of MapReduce for storage purpose whereas spark will have its own engine. But both results the same output⫷

➽ Types of Partitioning

Let’s get in-depth and see how it works:

STATIC PARTITIONING:

➊ Using existing column:

In this method, we are creating the partition based on a particular column that already exists in the table.

Step 1: Check the available databases and tables.

Step 2: Now Create a table

create table p_patient2(pid int,pname string,gender string,amt int)
partitioned by (drug string);

➡To describe the patient1 table give the below command

Step 3: Insert the data and to select drug partition that contains only ‘Para’

insert overwrite table p_patient2 partition(drug=’Para’)
select pid,pname,gender,tot_amt from patient where drug=’Para’;

insert overwrite table p_patient2 partition(drug=’hcq’)
select pid,pname,gender,tot_amt from patient where drug=’hcq’;

➋ Using new column:

In the previous method we used a existing column to create a partition, But in this method we are going to create a new column by which the partition is going to take place.

Step 1: Creating a table with new column.

create table p_patient1(pid int,pname string,gender string,amt int)
partitioned by (d string);

Step-2: Insert the value into the table where the ‘ drug=’Para’ or drug =’Crocin’;

Data in our table p_patient1:

➡Dynamic Partitioning :

  • Dynamic partitioning can be performed on the hive external table and managed table.
  • In Dynamic partitioning, there is no requirement of the where clause.
  • If you want to perform partition on the tables without knowing the number of columns in that case you can use Dynamic partitioning.
  • Let’s just ensure it by using the following code.

Step 1: Create a table

Step 2: Insert the data from patient without where clause

Inside each partition we will be having another sub partition based on the gender.

Now let’s look into our Default (Bucketing) Partition.

Default Partition:(Bucketing)

➡ It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets.

  • The concept of bucketing is based on the hashing technique.
  • Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 3).
  • Now, based on the resulted value, the data is stored into the corresponding bucket.

Step 1: Create Table.

Step 2: Before passing on the data into our table we have set some properties.

Step 3: Insert data from patient.

Step 4- Select the data of buckets.

We will be able to view contents of a particular Bucket using the below code.

Hope this blog will help you a lot to understand what exactly is partition , what is Static partitioning , What is Dynamic partitioning . We have also covered various advantages and disadvantages of spark partitioning.

If you have any query related to spark Partitions, so please leave a comment. We will be glad to solve them….!!

Will catch you all in the next blog …..!

--

--