Delta Lake (Part 2) — Overcoming challenges of Data Lake

Md Sarfaraz Hussain
8 min readSep 5, 2021

--

We have seen the problems with traditional Data Warehouses and Data Lakes in Part 1 of the series (recommend to read the first part and then come back to this), but what we would love to have to cater to all those problems?

We simply want the best of both worlds.

So, as a result of the above considerations, we have Delta Lake which provides the Scale of Data Lake, Reliability & Performance of Data Warehouse, and provides low latency access that can be used for streaming as well.

WHAT is a Delta Lake?

According to the official definition: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake

Let’s try to understand how Delta Lake can solve our challenges in previous architecture (discussed in Part 1) and can simplify it.

So, we had this earlier -

Source: Databricks

In this, we were spending all our time thinking about the system’s problem. With Delta Lake, we can change the architecture to something like below where we are thinking about data flow –

Source: Databricks

Let’s dive into the above Delta Lake based architecture and find HOW it works –

Source: Databricks

We have different Data Quality Levels like Bronze, Silver, Gold tables (these are not specific to Delta Lake, it’s just a naming standard and you can eliminate any of the layers based on the use-case i.e., there are no hard and fast rules). The idea behind having different layers of data based on the quality is that rather than going directly from raw data to the final result, we incrementally improve the quality of the data.

Source: Databricks

Starting at the beginning we have our Bronze layer where we dump raw data, this is just whatever coming out of Kafka, Kinesis, Data Lake, Spark, etc. It’s the dumping ground for raw data which can hold historical data of many years.

Let me know in the comments why do we store raw data coming from sources and not directly perform ETL on the incoming data?

Source: Databricks

The next step is to move to the Silver table where we perform some of the cleaning, parsing, joins, lookup, etc. We materialize this intermediate data so that we can serve multiple purposes like –

a. This clean data can be used by multiple people in different downstream applications.

b. Training ML models on it.

c. Used for debugging. When something goes wrong in the final analysis/result, we always have this table that we can query with the full support of SQL.

Use the comments section to add more advantages of the Silver table.

Source: Databricks

Finally, we move on to the Gold layer, which is high-level business aggregate results that are used for reporting/streaming analytics or running ML models for AI. Any downstream application (Spark, Presto, Hive, etc.) can read this data as well.

Source: Databricks

Some out of the box offering of Delta Lake

a. Handling SCD (Slowing Changing Dimension)

b. Handling CDC (Change Data Capture)

c. Support for INSERT/UPDATE/DELETE operations

d. Supports UPSERT/MERGE/OVERWRITE operations

e. TIME TRAVEL

WHY do we use Delta Lake (high-level features) –

1. Full ACID Transaction — Focus on your data flow, instead of worrying about failures.

- It acts as a database.

- Atomicity: When a job runs it either runs completely or fails completely, i.e., it never leaves partial results.

- Consistency: Provides consistent results even if multiple people modify the same data at the same time. (We will talk about this in detail in an upcoming blog)

- Isolation: If we read data and meanwhile someone modifies the data, we will continue to see a consistent snapshot of the data.

- Durability: All the transactions made on Delta Lake tables are stored directly to the underlying storage Data Lake (S3/ADLS/GCS/HDFS). This process satisfies the property of durability, meaning it will persist even in the event of system failure.

- If something goes wrong, it can be rollbacked and we don’t have to do any extra clean-up activity.

2. Open Standards, Open Source — Store petabytes of data without worries of lock-in i.e., the data is stored as Parquet file format. Delta Lake can be integrated with Apache Spark, Apache Presto, Apache Hive, Pandas, etc.

3. Deeply integrated with Spark — Unifies both streaming and batch processing with the combination of Delta Lake with Apache Spark. We can convert existing jobs to run with Delta Lake with very minimal modifications.

How to use Delta Lake with Apache Spark?

1. Locally using shell

2. With project –

Maven:

<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>1.0.0</version>
</dependency>

SBT:

libraryDependencies += “io.delta” %% “delta-core” % “1.0.0”

More details can be found here.

3. Changes in Code –

Current:

spark.write.format(“parquet”).save(“/data”)

New:

spark.write.format(“delta”).save(“/data”)

And rest everything will work the same as earlier, but now we have the power of ACID transactions and other things. :)

Summary of the lesson learned

With Delta Lake, we have solved the earlier complex architecture into a much simpler and cleaner view where there is three-layer architecture (Bronze, Silver, Gold), where the Bronze layer is used to dump raw ingested data and this is the least cleaned up data including all the corrupted, missing data, etc. Then we move from left to right, where we incrementally clean up the data to improve the quality of the data. So, the Silver table would consist of all the cleaned-up, filtered data, augmented data with more information, etc. And the final Gold table would be all the aggregates on top of the Silver table that is directly consumed for reports and ML model.

So, this makes our entire pipeline unified between batch and streaming as well as much simpler to reason about in an end-to-end manner because across the entire pipeline we have full ACID transaction guarantees, so we can focus on our data and not worry about the consistency of the data, failures, etc.

Additionally, we can remove the batch jobs and run the entire pipeline as streaming, by following the Kappa Architecture. Streaming doesn’t only mean streaming in real-time end-to-end with seconds intervals, but we can stream jobs periodically as a periodic one-hour job such that streaming takes care of all the exactly-one guarantees, fault tolerance, etc. in an end-to-end manner.

Delta Lake has full support for all the standard DML operations like Delete, Update, Merge with full transactional guarantees so that we can worry about running the query and not worry about what happens when it fails in the middle.

Hence, we have achieved all the goodness of Data Warehouse and Data Lake!!! :)

Advantages of Delta Lake with Streaming jobs rather than Batch

People often use streaming to move data through the Delta Lake layers instead of Batch because streaming is about incremental computation on an ever-growing set of data and every business has some queries to be run upon the new incoming data (which might also change in some time intervals). The kinds of problems streaming solves are the same problems we have to solve manually with a batch job, those are -

a. Identifying new data and what’s already been processed.

b. How do I pick only the new data and process it exactly once and commit it downstream transactionally.

c. How do I checkpoint the processing, so that I can recover if the system crashes.

These are all handled by streaming jobs out-of-the-box. We can use micro-batching in Structured Streaming with Delta Lake to cater for these automatically.

Many use-cases have datasets coming from sources every hour/day/week and in these cases, we can use Trigger Once, with trigger once the stream starts up at the beginning, processes everything that’s there and then shuts down and if we are using the power of the cloud this is an added advantage because we can take advantage of the elasticity of cloud i.e., when the stream processing is done we can shut down that cluster off and stop paying for it, so we can save processing costs by taking advantage of streaming.

So, streaming allows us to focus only on data flow rather than the controlled flow of processing and can dramatically simplify the work.

We have discussed all the challenges that we face with Data Lake except for overcoming the challenge of Reprocessing, let’s talk about it in a nutshell -

If we keep all the raw data in the Bronze layer and we use streaming to do the processing, it now becomes very easy to solve cases where we might have made a mistake (a bug in existing code) or we come up with some new business rule that we want to compute from scratch. The reason why it becomes convenient with streaming is that —

When we start a stream with a fresh checkpoint, it’s going to compute the same result as a batch job over the same dataset. The way streams work is when they first start they take a snapshot of the Delta Table at the moment when the stream starts and then it breaks the snapshot into a bunch of individual pieces and processes them incrementally. When they get to the end of the snapshot they start tailing the transaction log to only process new data that has arrived since the beginning.

This is another great scenario where we can take advantage of the elasticity of the cloud. When we do the computation for the initial back-fill, we can start up with a thousand node cluster and process it quickly and when we get to the incremental processing at the end we can scale the cluster down and can keep only 10 machines running so that we can reduce cost significantly.

This is all from this blog. In the next part of this blog series, we will find out how Delta Lakes works under the hood.

I hope you enjoyed reading this blog!! Share your thoughts in the comment section. You can connect with me over LinkedIn for any queries.

Thank you!! :) Keep Learning!! :)

--

--

Md Sarfaraz Hussain

Sarfaraz Hussain is a Big Data fan working as a Data Engineer with an experience of 4+ years. His core competencies is around Spark, Scala, Kafka, Hudi, etc.