What is Data Lake?

Sunny Kusawa
3 min readSep 19, 2022

--

The future of data world

A data lake is a centralized repository that allows you to store all your structured, semi structured, and unstructured data at any scale.

You can store your data in raw format, without doing any data preprocessing, formatting, or cleaning.

Image Source: lakefs

Data can be in any format

Structured data: It is well defined and formatted data in tables, databases etc.

Semi structured data: It is the structured data in more unorganized manner like email, csv file, xml, JSON. In short, we can also call it as combination of structured and unstructured data up to some extent.

Unstructured data: It the data which is not modeled or organized in predefined manner. Like larger text files, server log, website content, social media post content, audio, image, video etc.

Data lake can store any possible data of business and make them available as per the need.

Why Data Lake?

Below are the reasons why we need Data Lake,

· Low-cost Storage

· Centralized storage for all kind of data

· Continues data Engineering is possible

· Analytics and predictive modeling with ML

· Easy access to large size data

· Data Governance

What data companies can store ins Data Lake?

It allows businesses to store data from their business application, website data, data from IOT devices, data from social media platforms, data from mobile apps, customer data, transaction data, product data, monitoring data, audit data and any other data which business want to store.

Data warehouse vs Data Lake?

Data warehouse used to store a structured data that can be accessed by business professionals for creating dashboards and visualization of the data.

It’s high-cost storage and less agile compared to data lake. But its more mature and stable.

Data lakes used to store unstructured data in raw formats. These data are mostly used for analytics and predictive modeling by Data Scientist and ML Engineer.

Data lake storage cost is less and its more agile and reconfigurable compared to data warehouse.

It is still in growing and getting mature day by day.

Challenges in data lake?

Data Lake Architecture — data sources, collecting data, making data available with ETL for use

Security — manage data security

catalog management — indexing huge amount of data so it can be accessed easily

Governance — creating replicas of data and administration

access control — large user base will access data from data lake so appropriate access control management is critical

consistency — data consistency is important. Data is keep growing and may updating in real-time so if multiple targets using same data, then it should be consistent for every target.

Data lake has ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making for businesses.

🧠 Want to learn more about Artificial Intelligence, Machine Learning and Data Science?

Everyday, I try to share some AI, ML and Data Science content. Join/Subscribe here for free →

References:

https://en.wikipedia.org/wiki/Data_lake

https://www.interviewbit.com/blog/data-lake-architecture/

https://lakefs.io/data-lakes/

https://www.qubole.com/blog/why-do-you-need-a-data-lake

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

--

--