Data Lake Vs Delta Lake

Harun Raseed Basheer
2 min readDec 3, 2021

--

DATA LAKE:

Data Lake is a storage repository that cheaply stores a vast amount of raw data in its native format.
It Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet, etc.

Drawbacks in Data Lake:

  • Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
  • No Quality Enforcement — It creates inconsistent and unusable data.
  • No Consistency/Isolation — It’s impossible to read and append when there is an update going on.

DELTA LAKE:

Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).

  • Delta lake brings full ACID transactions to Apache Spark. That means jobs will either complete or not at all.
  • Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
  • Delta lake is deeply powdered by Apache Spark which means that the Spark jobs (batch/stream) can be converted without writing those from scratch.

Delta Lake Architecture:

Delta Lake Architecture

Bronze Tables:
Data may comes from various sources which could be Dirty. Thus, It is a dumping ground for raw data

Silver Tables:
Consists of Intermediate data with some cleanup applied.
It is Queryable for easy debugging.

Gold Tables:
Consists of clean data, which is ready for consumption.

--

--