Data Lakes: Inspired from lakes

Basic understanding of Data Lakes

Deeksha Kukreti
Your Data Universe
2 min readMar 19, 2023

--

Data Lakes inspired from Lakes
Image Credit — https://www.hurriyetdailynews.com

Recently, I came across a discussion on Data Lake which has increasingly become common. These days, many corporate industries like Google, LinkedIn and Facebook use it. In this article I want to share why the concept emerged, what is data lake and how is it different from the traditional data systems?

The term data emerged from James Dixon, CTO of Pentaho in his blog https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/. Where he says that data mart is like bottled water — cleansed, packaged and structured for easy consumption — the data lake is a large body of water on a natural size. The content of data lake stream come in from source which various users of data lake can later examine, dive in or take samples.

A data lake is a modern technology where all the data from heterogeneous sources are ingested, to save raw data which can further be used for advance analysis. They provide a large centralised repository of data consumed for various business use cases and multiple data streams such as machine learning, AI, data engineering and data warehouse.

It is important to know about your customers to understand the need of data lake from business perspective. It can be used by customers building Business Intelligence/Analytics which runs schedule jobs to ingest data from data lake. Data scientist can also use it for exploratory analysis and machine learning. It also supports self service where business users without help or wait from IT teams, can find and use data set for their analysis. They significantly differ from traditional databases.

Data Lakes are unstructured or semi-structured, however the traditional databases require structured data with the predefined schema. Having said that, the data lakes have the capability to process big data and batches, whereas the traditional databases are limited to small set of data. Data lakes are more flexible as they can be scaled up and down according to the data ingested or required, however due to the complex schema structure of traditional databases, it becomes difficult to maintain them. Also traditional databases can handle only certain amount of specific data. Because of the complexity of traditional databases, the cost and maintenance of it is higher than data lakes.

Overall, the evolution of data lake is driven by the need of flexibility, scalability and efficiency of managing big data. As it evolves, we focus to see increased data governance, data quality and inline development of general and domain-specific advance analytics.

What are your thoughts? Do share them, and we can connect to talk on it.

--

--

Deeksha Kukreti
Your Data Universe

Technology Enthusiast | Data Architect | Scientist | 2 X AWS Certified | Microsoft | Data Wizard