Data Lake Vs Delta Lake

2 min readDec 3, 2021

DATA LAKE:

Data Lake is a storage repository that cheaply stores a vast amount of raw data in its native format.
It Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet, etc.

Drawbacks in Data Lake:

Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
No Quality Enforcement — It creates inconsistent and unusable data.
No Consistency/Isolation — It’s impossible to read and append when there is an update going on.

DELTA LAKE:

Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).

Delta lake brings full ACID transactions to Apache Spark. That means jobs will either complete or not at all.
Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
Delta lake is deeply powdered by Apache Spark which means that the Spark jobs (batch/stream) can be converted without writing those from scratch.

Delta Lake Architecture:

Bronze Tables:
Data may comes from various sources which could be Dirty. Thus, It is a dumping ground for raw data

Silver Tables:
Consists of Intermediate data with some cleanup applied.
It is Queryable for easy debugging.

Gold Tables:
Consists of clean data, which is ready for consumption.

Data Lake Vs Delta Lake

DATA LAKE:

Drawbacks in Data Lake:

DELTA LAKE:

Delta Lake Architecture:

Written by Harun Raseed Basheer