Partitioned Delta Lake : Part 3

Aravinth
3 min readJan 8, 2020

--

Welcome to third part of this series on how to use partition in Delta Lake efficiently.

The previous posts can be found here:

Introduction to Delta Lake

Delta Time Travel for Data Lake

What is Partition?

A big problem can be solved easily when it is chopped into several smaller sub-problems. That is what the partitioning technique does. It divides a big database containing data metrics and indexes into smaller and handy slices of data called as partitions. The partitioned tables are directly used by SQL queries without any alteration. Once the database is partitioned, the data definition language can easily work on the smaller partitioned slices, instead of handling the giant database altogether. This is how partitioning cuts down the problems in managing large database tables.

The partitioning key consists of a single or supplementary columns with the intention of determining the partition wherever the rows will be stored. Spark modifies the partitions by using these partition keys.

Partitioning Techniques

Partitioning Techniques

Create Partitioned table:

Method to create Partitioned table

Read Partitioned table:

Method to read Partitioned Table

Creating and Reading table from Delta lake

I have used sales_test.csv file for my demo. Creating table with partition column as date and reading table using above method.

When we can see storage location of table it is partitioned with date. You can see separate folder is created for particular date.

Updating Partitioned table:

I am going to modify no of units for specific date. After I am updating partitioned table. delta lake will update only data that matches predicates over partition columns.

Best Practices in Partitioned table:

Choose the right partition column:

You can partition a Delta table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:

  • If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  • Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

In Upcoming series , I will explain about upsert and insert in Delta Lake.

Thanks for reading!!!!

See you soon :)

--

--