Vinod Indira
Litmus7 Systems Consulting
7 min readSep 30, 2022

--

Delta Lake — 3rd generation data analytics platforms for enterprises Backdrop

Back in the day when enterprises started with business intelligence reporting and descriptive analytics, Data warehouse evolved as the go to solution for this purpose. However with the digital revolution, data started to increase tremendously, not just in terms of volume but also in terms of the type of data (variety) and the speed at which data was generated (velocity) and thus arose the need for data lakes. Data lakes (Apache Hadoop based) are file based low cost storage systems that holds the data in query enabled file formats such as Apache Parquet, ORC and sometimes as AVRO. Data lakes with its ‘schema-on-read’ approach proved to be the best approach for storing data at speed and quantity at relatively cheaper cost. Raw format of data storage also enabled machine learning and AI. However for conventional BI reporting, data still needed to be transformed and loaded from data lakes to a data warehouse. Towards the latter part of the last decade cloud data stores (object stores like S3 , Azure Blob , GCS) started to be preferred over on premise Hadoop based data lakes.

Why Data warehouse and Data Lakes won’t suffice?

Challenges with Data warehouse

  • Supports structured data
  • Data warehouse ETL used for batch processing of data, minimal support for streaming
  • SQL friendly or limited to it, no direct support for advanced analytics like ML
  • Poor performance and cannot be scaled easily with increasing workloads
  • Data warehouses are costly

Challenges of Data lake

  • There is no consistency guarantee because of job failure due to corrupt data and hardware or software failure
  • Creation of small file is another challenge, when we dump files in small intervals it leads to creation of smaller files
  • Difficult to perform operations like update, delete and merge operations
  • Data validation and quality enforcement are poor
  • Cloud data lakes on object stores like a S3, Azure blob storage or GCS stores data as key-value pairs and hence difficult to achieve ACID. transactions [https://en.wikipedia.org/wiki/ACID]

To circumvent these problems, most organizations follow a two tier approach (2nd generation of data analytics platforms) where the initial ingestion happens onto a data lake and further on an “ETL” process pushes the data into a Data warehouse. However even this approach has its fair share of issues especially with respect to reliability, data staleness and data pipeline complexity increasing the number of fault points and entry points for data issues.

How do we address these issues?

Around 2020, the concept of Lakehouse started becoming common, we needed a new paradigm that was combining the best of both worlds — Data Lakes and Datawarehouses. The paper by Michael Armburst, Ai Ghodsi, Reynold Xin, and Matel Zaharai (Databricks and UC Berkeley and Stanford University) [3] covered this concept in detail explaining the need for such a solution and how this solves for the issues that are current in today’s data platforms. Delta and Apache Iceberg are two software systems that help to implement a Lakehouse. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes [9]

Databricks provides integration with Delta.io [https://delta.io/] that helps achieve the lakehouse paradigm. In our blog we will cover Delta Lakes in Databricks environment where data can be “stored” as Delta tables.

How does Delta lake support ACID properties

Deltalog is the key factor of delta tables and act as a transactional log which ensures atomicity, consistency, isolation, and durability of user-initiated transactions.

Delta lake enables ACID support, so if a job fails during the “overwrite” operation, data is not lost as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake does not remove old files in the “overwrite operation”, old state of data is maintained and there is no data loss.

Atomicity & Consistency:

DDL/DML operations that happen with the delta table are logged onto a transaction log json file created in _delta_log directory which maintains the latest commit info. Failed transactions will never have the data written to the tables. This approach ensures that data will not be corrupted and hence ensures atomicity Also, delta lake’s strong schema check ensures consistency.

Isolation & Durability:

Isolation means the ability to support multiple transactions where the transactions don’t interfere with each other. In delta lake, when a transaction starts it takes a snapshot of the current delta log and when complete ‘delta lake’ checks the delta log updated for any other transactions. The audit log files are created with sequence, based on the order in which the transactions were completed.

All of the transactions made on Delta Lake tables are stored directly to the disk. This process satisfies the ACID property of durability, meaning it will persist even in the event of system failure.

Find below Comparison:

Delta Lake Demonstration

For the demo purpose we have extracted sample datasets from Kaggle.com. Details of the dataset is as follows :

Departments, Products and Orders are the dataset used.

Departments data

Storing to DBFS from batch processing.

Schema: department_id (int), department (str)

Products data

Storing to DBFS from batch processing.

Schema: product_id (int), product_name(str), Price(int), department_id(int)

Orders data

Storing to DBFS from structured streaming.

Schema: order_id(int), product_id(int), add_to_cart_order(int), reordered(int)

. Moving on to the folder structure with Bronze, Silver, Gold layers are created as delta layers.

. Raw data from the source is loaded into Bronze layer tables.

  • Silver layer tables are created with the combination of all datasets and also contain cleansed data
  • Gold layer tables houses the data in the final format needed to address functional requirements

For our demo what we tried to do is have product and department wise total orders from the raw data that is created.

On the product table we have inserted data twice, with different order dates. It created two version of files. Delta inturn creates transaction logs for each load

Below is the snapshot of _delta_log_ folder which contain json files for each version.

Another important feature that a delta lake provides is the ability to traverse older data versions or in other words time travel. The snippet below shows the time travel functionality of the delta table.

Version 1

Version 2

Schema Evolution

To address situations where columns change over a period of time, delta table have schema evolution feature. In RDBMS if there is changes to the schema of incoming data then the loading process would fail unless the table structure is changed prior to the load.

Below the example for schema evolution feature with simple dataset.

Here we are creating two dataset with different schema and storing into same location as Delta table.

df1 is created with the below schema.

Below the result of df1

Saving df1 as Delta

Df2 created with different schema and number of columns are also different.

Now lets append df2 to same path with option mergeSchema.

As a final stage creating table on top of the location where we stored df1 and df2.

When we run select query on top of the delta table, can see the schema evolved and captured both df1 and df2 data.

Delta Lake Index

One of the critical auxiliary data elements in a RDBMS is indexes. Indexes are used for improving query performance and data access. Since Delta tables are going to function primarily as data warehouse entities , query performance is of importance. In Delta Lakes index are available at table level and stored as key/value metadata in JSON.

One of the index types provided in Delta tables is Bloom filter index

A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text.

Configuration for index

SET spark.databricks.io.skipping.bloomFilter.enabled = true;

set delta.bloomFilter.enabled = true;

It will create delta_index folder for delta indexing and part file save as parquet file as you can see in the screenshot below.

References

[1] https://towardsdatascience.com/delta-lake-with-spark-what-and-why-6d08bef7b963

[2] https://www.chaossearch.io/blog/why-data-lakes-fail

[3] https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

[4] https://www.databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf

[5] https://www.netapp.com/data-storage/storagegrid/what-is-object-storage/

[6] https://resources.useready.com/blog/s3-vs-hdfs-comparing-technologies-in-the-big-data-ecosystem/

[7] https://www.firebolt.io/blog/the-real-meaning-of-a-data-lake

[8] https://www.databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf

[9] https://www.databricks.com/blog/2020/09/10/diving-deep-into-the-inner-workings-of-the-lakehouse-and-delta-lake.html

[10] https://www.databricks.com/discover/data-lakes/introduction

[11] https://dbdb.io/db/delta-lake#:~:text=Delta%20Lakes%20have%20indexes%20on,of%20the%20column%20and%20partition.

--

--