Facilitating Business Demands with Databricks Lakehouse

Arujit Das
8 min readAug 1, 2020

--

Before I go through the Databricks Lakehouse and discuss its significance in today’s business, let us take a step back and realize the need of data lake. A Data Lake makes our life easy by storing massive data from diverse data sources so that further insights could be gained from this vast data.

Although, Data Lake has been in place for a couple of years, meeting business demands, there is a need for the following parameters

1. ACID Transaction

2. Schema Enforcement and Evolution

3. Time Travel

4. Unified Batch and Streaming

5. Full DML support

Here, comes the beauty of Lakehouse. In this dossier, I will try to explain the above parameters with a hands-on use case. A typical data lake pretty much looks as depicted below. We will try to create a similar data lake but with the help of Databricks Delta lake and utilizing the above said benefits/parameters.

Complete source code is available at my Github. Feel free to download and share your feedback.

https://github.com/arujitdas/Databricks_Deltalake

I will also try to cover up few of the best practices while developing a data lake.

Let us first create our Landing zone. For the sake of simplicity, I will leave it to the developers/architects to ingest the data in Landing zone utilizing different orchestration mechanism. For time being, I would consider the Azure blob storage as my Landing zone and I will upload a file there. This is the file that will be ingested to different zones after manipulation/transformation, as needed. I am considering one file, but it can be an automated process to load multiple and terabytes of data and ingressed from one zone to another.

Prerequisite:

1. Azure blob storage is configured

2. Azure Databricks workspace is created, and a notebook is associated with a cluster

3. For this use case, I have downloaded the Covid-19 data for 25th July 2020 (07–25–2020.csv) from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports and uploaded into my Azure blob storage

Note: Eventually there exists different nomenclature too like bronze zone, silver zone and gold zone corresponding to raw zone, trusted zone, and refined zone respectively. We will use these terminologies going forward for our hands-on.

Delta Lake is an open source storage layer that runs on top of the existing data lake.

Now let us try to access the file present in Azure blob storage from an Azure Databricks notebook, so that data can be loaded to Raw Zone. A typical option to do this is to connect via blob storage account key while exposing the account keys in notebook. However, the best practice is to use the secrets in the Azure key vault and creating a scope in Azure Databricks. Once this is done, we can mount the Azure blob storage to Databricks notebook and all shared members can access the files. This will be a secured way where the credentials are not exposed.

In my case, I have below services:

1. Blob storage:

a. Name: covidstgaccount

b. Container name: covidcontainer

c. Account Key: xxxxxxxxxx

d. File name: 07–25–2020.csv

2. Key vault:

a. Name: covidkeyvaluevault

b. Secret name: dbkeyvault

i. DNS name: https://covidkeyvaluevault.vault.azure.net/

ii. Resource ID: /subscriptions/xxxxxxxxxxxxxxxxxxxxx/resourceGroups/DBResourceGroup/providers/Microsoft.KeyVault/vaults/covidkeyvaluevault

3. Databricks workspace

a. Name: covidworkspace

b. Secret scope name: dbscope

c. Cluster: standard, DB run time: 6.5, worker and driver type: Standard_DS3_v2

d. Notebook name: covidAnalytics

Note: A visual presence will not be available to create the secret scope name in Databricks workspace. To create the secret scope name in Databricks workspace, open a separate tab and enter the following such as:

https://<\location>.azuredatabricks.net/?o<\orgID>#secrets/createScope

Provide the secret scope name and also fill up the DNS name and resource id from 2.b.i and 2.b.ii respectively as gathered above.

Now, let us open our notebook and initiate the hands-on to see how to create a Lakehouse utilizing the benefits of Databricks Delta Lake.

  1. We need to first mount the blob storage to Databricks notebook so that it is accessible without having to create the integration point again.

2. Now, we can browse through the file from the mount location as if we are using a local storage. Let us do a quick peek at the data.

3. Databricks is smart enough to infer the schema; however, it is best practice to explicitly specify the schema. In this case, resultant will be same but let us try with both the methods.

Infer schema:

Explicit Schema:

4. Let us start moving the data from Landing zone to Raw zone/Bronze zone i.e. moving the data to Bronze delta table. This is the zone where attributes processing, or metadata creation take place; usually renaming of columns as per the business needs.

To create the delta tables, simply replace parquet with delta.

We can also directly access the data from the path.

5. Let us now create the database and the respective bronze, silver, and gold delta tables.

Note: the Cmd 16 mentions below is nothing but a markdown cell that can easily be created using the command %md.

6. Now, let us create the silver zone and silver delta table. Silver delta tables could be refined tables or tables after joining with reference tables. For this use case, I am using the Databricks provided country look up table and joining with my bronze table.

Now, we have the silver zone and let us peek into the silver zone records.

7. Gold zone is our final zone where usually aggregated data will be available. This zone can be used for analytics purpose and to gather insights.

Above steps clearly explained how we have created a Databricks Delta Lake. Still, we need to investigate the benefits that Delta Lake provides.

ACID Transaction:

Databricks transaction log is the answer how ACID transaction is achieved. Every transaction is recorded that has been done on Delta Lake under _delta_log subdirectory. All operations performed on Delta Lake are collected and recorded as atomic commits. Considering the above zones that we have created, if we investigate our Databricks File System, we can see the following .json files ending with either xxxxx000.json or xxxxx001.json. This list will continue as and when we do any modification to table providing the atomicity and isolation.

Most of the time, Databricks does the reconciliation silently and seamlessly using the concept called as ‘mutual exclusion’ whenever there is a question of multiple writes simultaneously. If two users try to commit, the moment first user’s commit is accepted; Databricks instead of throwing error to second user, will update the table for the second user silently and retry commit from the second user.

Schema Enforcement and Evolution

Delta Lake also provides an option to specify and enforce the schema to prevent bad data ingestion and finally helping us to avoid ingesting corrupt data.

If we add a new column and try to save the records, we would be getting an error because schema enforcement is in place.

It is also possible to evolve the schema using Databricks Delta lake’s mergeSchema mechanism.

Full DML support

Delta Lake also has full support of DML like Update, Delete and Merge. With Delta lake, we can delete records directly unlike parquet. Let us evaluate this with an example. It is also not possible to update the records directly from parquet but easy to achieve this in Delta.

Merge Operation

Unified Batch and Streaming

Delta Lake provides an unique option where it is easy to even write both batch and streaming data into same delta tables. Tip of the iceberg is that Delta lake tables can be both source and sink.

Time Travel/Data snapshots

Often data professionals face challenges with respect to change data capture. Auditing data is crucial to understand the history and to understand what has happened over time. It is also important to rollback to a previous stable state if there is any data quality issue. Delta Lake solves these issues with Time Travel feature. We can easily investigate the history by running the command DESCRIBE HISTORY or through the UI.

We can audit the changes by travelling back in time.

If require, we can rollback the updates using any of the below approaches.

Wrapping Up

Good quality along with reliable data is key for analytics and getting insights. Lakehouse provides one way to achieve this; however, Databricks Delta Lake provides few other key components too for modern data architecture and data engineering as explained above.

--

--

Arujit Das
0 Followers

Senior Architect — Data and Analytics