What is Data Lake?
The future of data world
A data lake is a centralized repository that allows you to store all your structured, semi structured, and unstructured data at any scale.
You can store your data in raw format, without doing any data preprocessing, formatting, or cleaning.
Data can be in any format
Structured data: It is well defined and formatted data in tables, databases etc.
Semi structured data: It is the structured data in more unorganized manner like email, csv file, xml, JSON. In short, we can also call it as combination of structured and unstructured data up to some extent.
Unstructured data: It the data which is not modeled or organized in predefined manner. Like larger text files, server log, website content, social media post content, audio, image, video etc.
Data lake can store any possible data of business and make them available as per the need.
Why Data Lake?
Below are the reasons why we need Data Lake,
· Low-cost Storage
· Centralized storage for all kind of data
· Continues data Engineering is possible
· Analytics and predictive modeling with ML
· Easy access to large size data
· Data Governance
What data companies can store ins Data Lake?
It allows businesses to store data from their business application, website data, data from IOT devices, data from social media platforms, data from mobile apps, customer data, transaction data, product data, monitoring data, audit data and any other data which business want to store.
Data warehouse vs Data Lake?
Data warehouse used to store a structured data that can be accessed by business professionals for creating dashboards and visualization of the data.
It’s high-cost storage and less agile compared to data lake. But its more mature and stable.
Data lakes used to store unstructured data in raw formats. These data are mostly used for analytics and predictive modeling by Data Scientist and ML Engineer.
Data lake storage cost is less and its more agile and reconfigurable compared to data warehouse.
It is still in growing and getting mature day by day.
Challenges in data lake?
Data Lake Architecture — data sources, collecting data, making data available with ETL for use
Security — manage data security
catalog management — indexing huge amount of data so it can be accessed easily
Governance — creating replicas of data and administration
access control — large user base will access data from data lake so appropriate access control management is critical
consistency — data consistency is important. Data is keep growing and may updating in real-time so if multiple targets using same data, then it should be consistent for every target.
Data lake has ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making for businesses.
🧠 Want to learn more about Artificial Intelligence, Machine Learning and Data Science?
Everyday, I try to share some AI, ML and Data Science content. Join/Subscribe here for free →
References:
https://en.wikipedia.org/wiki/Data_lake
https://www.interviewbit.com/blog/data-lake-architecture/
https://www.qubole.com/blog/why-do-you-need-a-data-lake
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/