Delta Lake (Part 1) — What’s the need?

Md Sarfaraz Hussain
6 min readAug 28, 2021

--

Building a data platform that can store real-time data from many different sources and trying to extract insights out of it is hard. In a typical data pipeline, there can be many different components like –

1. Extraction of data
2. Cleaning of data
3. Transforming and loading the data
4. Machine Learning, these are often hard.

Let’s try to find out how Delta Lake solves these pain points in a step-by-step process. But first let’s try to understand the problems with Data Warehouse and Data Lake.

Working with Traditional Data Warehouse

Source: Databricks

Traditionally for many years and still in some cases, the way how ETL is done is –

Assume, we have a complex ETL process that does the following:

a. Cleans the data.
b. Converts data from unstructured/semi-structured to structured.
c. Writes it down into data warehouses.
d. From the data warehouse, we have further levels of processing like SQL, Reporting, etc. to get insights out of the data.

Now, the problem that people face with traditional data warehouses were –

a. Creating and maintaining complex ETL pipelines.
b. No support for streaming, so the ETL was done in dumps of data.
c. Limited to only SQL, no direct support for advanced analytics like ML.
d. Poor performance and cannot be scaled easily with increasing workloads.
e. Not cost-efficient, Data warehouses are costly.

Working with Data Lake

Source: Databricks

Moving to the next generation of data storage, which is the Data Lake, which provides the following advantages over traditional Data Warehouse –

a. Supports seamless scaling up and down.
b. Provides support for different frameworks to process the data like MapReduce, Hive, Spark.
c. Data can be stored in different formats like ORC, Parquet, JSON, AVRO, etc. as per need.
d. Provides support for ML workloads (but not so efficient) to get insights out of the data.
e. Cost-efficient, Data Lake are cheap.
f. Can store data in all forms i.e. structured, semi-structured and structured.

Working with Data Lake and Spark

Source: Databricks

Since Spark is a unified processing engine so we can do ETL in both batch and streaming fashion along with Spark SQL, Spark ML, etc. So basically, it means that the same APIs for batch and streaming essentially simplified ETL.

Benefits –

1. We can build our ETL logic and test it on batch dumps of data to fine-tune it and then take the same code and then run it in a streaming fashion for continuous ETL.

2. With the support of streaming, we can analyze the same unstructured/semi-structured raw data within minutes.

3. Now we are not limited to the only SQL, we have the capability of Spark ML and other complex transformation operations as well.

4. We can use commodity hardware, more further cost reduction. (But maintenance is trouble)

Now, let’s talk about some limitations of this Data Lake + Spark model.

Slowly as the demand for building more and more complex use-cases and pipelines are getting started, people started realizing that Data Lake doesn’t solve everything. It has its challenges in practice. So, let’s see what those problems in practice faced by Data Engineers are.

Challenges –

Let us try to understand the challenges with a scenario.

Assume there is a service running that is generating logs and we are collecting those logs using Apache Kafka. We want to do 3 things –

1. Process the logs in streaming fashion for real-time analytics to see what’s going on right now.
2. Perform historical (batch) analysis, for that, we want to store the logs in Data Lake.
3. We want to do daily reporting as well.
Uff..a lot of work. Gosh!!

Source: Databricks

Let’s see how we do this –

1. For the streaming analytics, we ran a pipeline using Spark Structured Streaming for processing data coming from Kafka.

2. Since we cannot store very long-term data in Kafka, so for historical analysis we ran a different pipeline of Spark Structured Streaming job to store the same data into Data Lake (S3/ADLS/GCS/HDFS) using open file formats like Parquet.

3. From the data stored in Data Lake, we can do reporting using the Spark batch job by creating the third pipeline.

Source: Databricks

This is the typical Lambda Architecture.

Now, as we have seen the Data Lake + Spark solution, let’s talk about the challenges and where does it fail to solve the problem efficiently –

1. We might get corrupted logs into Kafka or in-flight.
2. Some columns might go missing in the JSON data from the source.
3. Data type might change.

So, now our pipelines started having problems and therefore we have to add additional Validation steps.

Source: Databricks

Without validation, this will also lead to problems in the Reconciliation process, if any. This essentially means that we need to go ahead and add Reprocessing logic in the entire pipeline if issue found with data after validation step. Reprocessing step will go ahead and clear/fix up those corruptions, add columns that were missing from that day’s/hour’s partition. Reprocessing cause additional complications that we have to deal with.

Source: Databricks

Other problems –

1. Consistency guarantees — There is no consistency guarantee because we are just dumping Parquet files.

2. If we try to dump the data into Data Lake in smaller intervals, so that the data quickly becomes available for analysis/reporting, it eventually leads to the creation of small files. Thus, we have small file issues, hence slowing up the entire processing system.

-- So, we have to go ahead and add additional processes to daily/hourly compact the tiny files into larger files.

3. As we are running a compaction job, so eventually our Reporting has to be delayed and scheduled at a certain time.

4. Performing Update, Delete and Merge operations on the data stored in the Data Lake is a challenge. Also, we need to ensure that these operations don’t affect the reporting job i.e. we need to ensure that both the jobs do not run concurrently because we might get inconsistent results in the reporting process. Hence, delaying the reporting process.

Source: Databricks

Summarizing the challenges of Data Lake –

  1. Reliability –

a. When we have any job failure, that causes file corruption. It is very tedious to recover out of them.

b. There is a lack of consistency and isolation guarantees whenever multiple workloads are simultaneously reading and writing to the same location.

c. Lack of schema validation because we can dump files of different schema at the same location. This means there is no quality enforcement and once invalid data is dumped it adds complication to the downstream application for cleaning the data and other kinds of stuff.

d. Lack of atomicity because when we are doing any kind of writes in the data lake, we do not know whether the entire operation was successful or not.

2. Performance –

a. When we have a large number of small files or very large files, processing engines like Spark doesn’t deal with it very well.

This is all from this blog. In the next part of this blog series, we will find out what Delta Lake is and how Delta Lake solves the challenges of Data Lake.

I hope you enjoyed reading this blog!! Share your thoughts in the comment section. You can connect with me over LinkedIn for any queries.

Thank you!! :) Keep Learning!! :)

--

--

Md Sarfaraz Hussain

Sarfaraz Hussain is a Big Data fan working as a Data Engineer with an experience of 4+ years. His core competencies is around Spark, Scala, Kafka, Hudi, etc.