Spark Delta Lake

ACID on Spark along with some upgradations.

Abid Merchant
Analytics Vidhya
4 min readNov 17, 2019

--

Hey Fellas,

Hope you are doing good, today I will explain you about Delta Lake!

This is 1st part of my article in which I will explain you about issues in Apache Spark and need of Delta Lake, next article will contain the interesting part of hands on…

One of the key note of this year’s Spark Summit was announcing Delta Lake Open Source Project. The main idea behind delta lake is bringing the ACID features back to Big Data which seemed like a long lost thing of past. Apart from Spark being robust and fast it lacked ACID qualities which seemed of little importance at start but gradually lack of it is hindrance for Spark Projects. Many Spark Big Data projects which went to production failed because of issues like:

  • Lack of schema enforcement
  • No support for delta load
  • Corrupted data in Data Lake

With Delta Lake we will be able to overcome these issues. Now, the most important aspect of Delta Lake is bringing ACID features back, but why is ACID features important and what were the drawbacks of not having it in Apache Spark?

Let me get to basics a bit and tell you the issues faced by us in Apache Spark.

  • Atomicity : It means the job should be completed fully or it should not make any effect at all. As per Spark documentation, Save Mode is not atomic but however this feature is followed in Apache Spark through FileOutputCommiter version 1 (for Hdfs).
  • Consistency : In Spark, when performing an overwrite, the data will be deleted before writing the new data. This means there will be timespan where data is completely lost and an exception at that time will result in loss of data which is bit scary as we might end up with no data at all.
  • Isolation : It means read and write operation on same dataset should be isolated with each other. Due to lack of atomicity in write operations in Apache Spark it does not guarantee Isolation.
  • Durability : This feature means “Once committed is never lost”. Hdfs helps to maintain durability in Spark, but issue might happen if Spark does not commit job properly.

Apart from the issues of ACID, lack of schema enforcement is also a major impediment as Spark follows schema on read and during delta loads if some column datatype mismatch happens, we will be able to write it to Data Lake without any issues, but this will result in corrupting our Data.

So, Databricks decided to put a full stop to these issues and announced Delta Lake! Delta Lake is built on top of parquet format which is most commonly used in Big Data projects.

Delta Lake brings us following features:

  • ACID transactions on Data Lake
  • Schema Enforcement
  • Time Travel Capabilities

With ACID transactions possible in Spark, we will be able to perform operations like Updation, Deletion and Merge. There was no possible way to update our existing data or Merge new data but with Delta lake all of this will be possible with only one command. Moreover, a transaction log file will be maintained.

With Schema Enforcement capability, schema on write will be followed. This means any change in datatype or schema will be caught red handed at the time of write which will stop it to go in Data Lake and corrupt our data and get us exceptions while executing our jobs which gave us Nightmare to rectify.

The best and the coolest part of Delta Lake is its Time Travel capability, which will be of utmost importance to fellow Data Scientists, because it will maintain different versions of the dataset each time we perform an operation on our dataset. So with Time Travel, overwrite or append to our Data Lake does not actually mean losing our old data in that actual form. We just have to search our dataset with the version we want and the dataset at that time will be returned to us, cool isn’t it. So, while running different dataset for a prediction model, we time travel to old datasets as well which might have given us better results.

So, this was a short overview of Databricks Delta Lake. In the next article I will show you hands on along with installation of Delta Lake in Pyspark. Till then good bye, ta ta, Vanakkam.

Click the below link for the follow up article containing code and hands on of Delta Lake.

Do checkout the video of Delta Lake Announcement at Spark Summit 2019 by Databricks.

--

--