Why Delta Lake ? How Change Data Capture (CDC) gets benefits from Delta Lake

How Delta Lake overcomes drawbacks of Data Lake

Karthikeyan Siva Baskaran
The Startup

--

Data Lake with Powerful features [Delta File Format]

Introduction

Enterprise has been spending millions of dollars getting data into data lakes using Apache Spark with the aspiration to perform Machine Learning and to build Recommendation engines, Fraud Detection, IoT & Predictive maintenance etc. But the fact is majority of these projects are failing in getting the reliable data.

Challenges with the traditional data lake

  • Failed Production Jobs will leave the data in corrupted state and it requires tedious job to recover the data. We need to have some script to clean up and to revert the transaction.
  • Lack of schema enforcement creates inconsistent data and low quality data.
  • Lack of consistency, while reading the data when there is a concurrent write, result will not be inconsistent until Parquet is fully updated. When there is multiple writes happening in streaming job, the downstream apps reading this data will be inconsistent because there is no isolation between each writes.

“Delta Lake overcomes the above challenges”

--

--