Fundamentals of Data Lake

Published in

Beginner’s Guide for Data Science

2 min readMay 21, 2020

Hi everyone, hope you all are doing well. Today we will try to understand basics of Data lake. Like what is data lake, why this introduce etc.

As we know companies frequently gathered millions of record from different variety of sources, typically using a variety of formats including csv, json and xml. Data Analyst or Data Scientist often extract insights from these data.

There is classic approach to querying these data to load into a central database called a Data Warehouse. But this process involves a lot of time-consuming operation of designing schema for the central database, extracting data from different data source, transforming the data to fit the ware house schema, and loading into the central database.

The classic Data warehouse approach works well but requires a great deal of upfront effort to design and populate schema.

So, alternative of this approach is Data Lake.

What is Data Lake?

it’s a storage that cheaply stores a huge amount of raw data in it’s native format. Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet etc. Also may contain operational relational databases with live transnational data.

Data Lake Disadvantage:

Data lake is suitable for storing data, but lacks some critical features. Data Lakes do not support ACID transaction i.e. Atomicity, Consistency, Isolation, and Durability, do not enforce data quality, and their lack of inconsistency makes it impossible to read, append and batch and stream jobs.

How to overcome of Disadvantage of Data Lake:

So to overcome such problem Data Lake should be build with guaranteed consistency. There is one technology called Delta Lake which can resolve this issue.

What is Delta Lake:

Delta lake is technology which is use to build robust Data Lake and is a component of building cloud data platform.
It’s a storage solution which is specifically designed to work with Apache spark.
Data lake build using Delta Lake having the ACID property which means the data stored inside data lake have guaranteed consistency.
That is the reason Delta lake is considered as a robust data sore, whereas the traditional data lake is not.

Summary:

This is just a 30000 foot view explanation on Data Lake. Hope this will give you a initial understanding on Data Lake.

Stay tuned for upcoming post. Thanks….. !