Delta Cloning in Azure Databricks

Published in

Globant

7 min readDec 12, 2022

Photo by Possessed Photography on Unsplash

When we work with data, various scenarios require you to create a copy of data. Data Archival, Unit tests, Training, and Test data for Machine Learning are a few use cases. So far, you may have used various ways to create copies of your data, such as CTAs (Create Table AS), Insert Select, Insert Overwrite, etc. These are vanilla approaches to creating copies and don't provide any advantages other than creating a replica of data.

In this article, I am going to walk you through a feature provided by Azure Databricks to create a clone of your data.

Delta Cloning

Let's imagine a feature that will automatically consider full metadata, including constraints and partitions, and maintain various versions of your replicas at different intervals. What if I say it will handle your replicas incrementally without doing a full load every time?

Yes, Delta Cloning is the way to achieve this. Databricks deltas provide an efficient and organized way of creating table copies.

There are two types of delta clones, Shallow Clone and Deep Clone. Let's discuss each one in detail.

Shallow Clone

In Shallow Clone, a copy of source table metadata will be made, but it will refer to the files from the source table itself; files won't be copied to the new table location. Hence to use shallow cloned target tables at any point in time, source tables files should always be available. If source tables files are deleted/vacuumed, the clone becomes unusable.

In the case of shallow clones, when you execute the select statement on the cloned table, it will refer to the source table's data and fetch the output for you. However, updating the cloned table with update statements will create new files in the cloned tables directory. It will not update source tables' data files.

Is this expected? Yes, because the shallow clone concept is designed for short-lived use cases where we can delete the clones after our use case is complete. In such cases, if we alter data files from the source table, our goal of deleting the files after use won't be achieved. Hence, the altered files are written to the cloned table location. When you don't need these files/data, delete them, and you will be good.

Let's see a practical example of a shallow clone. Consider below source table named student. It contains five records, as shown below:

Let's create a database where all the shallow clone (backup) tables will reside (this is a recommended approach ). In a real cloud scenario, this will be the database present in the different regions.

%sql
create database db_shallow_clone location '/FileStore/shallow_clone'

Now, let's look at a shallow clone of student source data using the below command and observe the output:

%sql
CREATE TABLE IF NOT EXISTS db_shallow_clone.student
SHALLOW CLONE student

As discussed earlier, executing the below query on the cloned table with shallow clones will refer to the source table's data and give you the output.

%sql
SELECT *,input_file_name() from db_shallow_clone.student.

All the records point to the source tables, as you can see if you study the above output. In our case, it is created in the default DB, which points to the hive warehouse path. If you try to see the directory of the student table inside the clone path, we see that it's empty. This is because after cloning the table, no files were copied into a new tables directory and my clone table still refers to the source table files.

%fs ls FileStore/shallow_clone/student

Delta log directory of shallow cloned table

Now, let's see what my logs look like. If you analyze this log carefully, a couple of things interest us.

"isShallow": true

This tells us that it is a shallow clone.

"numCopiedFiles": "0"

This tells us that we have not copied any file during the clone operation.

"operation": "CLONE"

This tells us that we have carried out a clone operation on this table. When we look at the history of this table, we will know why this is important.

%fs head FileStore/shallow_clone/student/_delta_log/0000000000000000000.json

Let's say we come across a scenario where we need to update the student's age to 30 for some actions to be executed. Can we do that directly on the source table? NO !! Our Mr. Clone says, "Hey buddy! I am here to help you". Let's do justice to Mr. Clone's proactiveness and update a record, setting the age as 30 for a student named Adam.

%sql
update db_shallow_clone.student set age=30 where name='Adam'

If you observe, we have one record inside the student directory under shallow_clone. Remember what we discussed earlier? If you update the cloned table with update statements it will create new files in cloned tables directly and will not update the source table's data files.

This is precisely what is happening. Although your clone table was referring to data files of the source table, once the update statements were executed on the clone table, it copied the updated file in the clone tables directory instead of updating the source tables data file on the fly. This is exactly where our source table remains intact. We can avoid accidental updates to the actual data. Don't you think this will be a convenient approach while doing tests on production?

%fs ls FileStore/shallow_clone/student/

Let us observe the file name corresponding to the records. Our first four records still point to the source tables' data files, but the updated records point to the data file in the shallow clone directory. It will copy the executed operation's data and not all the files. This approach saves lots of storage in the environment leading to cost-saving.

%sql
SELECT *,input_file_name() FROM db_shallow_clone.student

Shallow clones are used for short-lived cases, such as testing in dev or mainly in prod, where you don't want main tables to be impacted. Deep clones are usually CTAs, but with the added functionality of version controls & incremental load, which makes them not CTAS. A shallow clone doesn't copy any data to the cloned directory; hence it's usually quicker to create and start working with, unlike deep clones.

Deep Clone

In Deep Clone, along with metadata, source tables data files are also copied to the target table. Hence, independent copies of data files are created. As a result, target tables behave as independent tables, and deletion/vacuum of source tables data doesn't affect the target table.

Let us see a practical example of Deep Clone. We will start by creating a DB where all the deep clones (backup) are stored.

%sql
create database db_deep_clone location '/FileStore/deep_clone'

The below command will help us create a deep clone of our data:

%sql
CREATE TABLE IF NOT EXISTS db_deep_clone.student
DEEP CLONE student

If you see the output here, it tells us that five files of size 5155 bytes are copied to the cloned directory. Let us check the directory of this cloned table to verify this:

%fs ls FileStore/deep_clone_student

There you go; five files are inside the directory, unlike shallow clones, where the directory would be empty.

Now, if you select data from the deep-cloned student table and check the corresponding files, the paths will point to the deep-cloned delta table and not the source table. If you have guessed it because files are copied, then you are right 😊.

SELECT *, input_file_name() FROM db_deep_clone.student

If you enter one more record in the source table and execute the clone command — this new record will be copied instead of copying all the records again. This is where the deep clone incrementally handles the data load, thus maintaining different versions of data at different intervals.

One of the best ways to leverage this feature is to copy data from one environment to another.

Let's say you want data in a test environment to carry out testing, and your data is stored in an Azure storage container. One of the best approaches is to mount this dev storage container on the test environment and execute the below command.

CREATE OR REPLACE TABLE db_clone.clone_table_name DEEP CLONE delta.`/mount/path/to/table/data`

If you schedule this command to run every 1 or 2 hours, it will copy the updated data into the clone table, which the test team can access for testing purposes. This schedule interval can be adjusted to best fit your case.

This entire process will act as a Disaster Recovery if you create a deep clone copy of your Azure Databricks table in a different region.

Summary

Now we know how to use shallow and deep clones to create replicas of the data. So, next time you need to create a clone, head to this wonderful feature provided by Databricks.

Delta Cloning in Azure Databricks

Delta Cloning

Shallow Clone

Deep Clone

Summary

Written by Unmesh Joshi