Data Lake Vs Delta Lake
DATA LAKE:
Data Lake is a storage repository that cheaply stores a vast amount of raw data in its native format.
It Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet, etc.
Drawbacks in Data Lake:
- Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
- No Quality Enforcement — It creates inconsistent and unusable data.
- No Consistency/Isolation — It’s impossible to read and append when there is an update going on.
DELTA LAKE:
Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).
- Delta lake brings full ACID transactions to Apache Spark. That means jobs will either complete or not at all.
- Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
- Delta lake is deeply powdered by Apache Spark which means that the Spark jobs (batch/stream) can be converted without writing those from scratch.
Delta Lake Architecture:
Bronze Tables:
Data may comes from various sources which could be Dirty. Thus, It is a dumping ground for raw data
Silver Tables:
Consists of Intermediate data with some cleanup applied.
It is Queryable for easy debugging.
Gold Tables:
Consists of clean data, which is ready for consumption.