What is a Data Lake?

Nadeem Khan(NK)
LearnWithNK
Published in
4 min readAug 27, 2023

--

In this blog, I will introduce Data Lake. The intended audience for this blog is the ones who are trying to step foot in the Data Analytics Industry.

With the growing data in different formats (CSV, JSON, docs, pdf, parquet, mp4, png and many more), Data Lake is becoming a buzzword in the Data Analytics Industry. Previous to this new era of Data, everyone in the Industry has always used SQL (My SQL, MSSQL, Postgres SQL, Oracle DB2) for almost everything.

AI and Data Lake

In the last year, everyone has heard about Chat GPT, AI, Open AI and so many other buzzwords. Data Lake is also going to play a very crucial role in this AI era. To build an AI solution, we need to store and process a lot of Data in different formats. Data Lake is a perfect solution for this because it provides the same capability at a very cheap cost.

Simplification of Data Lake

The simplest way to understand data Lake is as a dumb machine sitting in the cloud with storage and little bit of compute.

For people coming from a tech background, think of a data lake as someone’s hard drive in the cloud or a network folder in the cloud.

For people coming from non-tech backgrounds think of Data Lake as a Google Drive or One Drive.

Every cloud provider has their offering for the Data Lake, Azure has Azure Data Storage Gen2 (ADLS Gen2) and Azure has Azure Data Storage Gen1 (ADLS Gen1); Amazon has an S3 bucket.

How is Data Stored?

Normal cloud data storage uses an Object store with a flat namespace while Data Lake uses a Hierarchical namespace to store data. Let's try to simplify the flat namespace and Hierarchical namespace.

Hierarchical Namespace

Hierarchical Namespace

To understand this, our laptop’s file explorer is a perfect analogy to understand Hierarchical namespaces Storage. As we create folders, subfolders and files in our laptop’s file explorer, hierarchical namespace storage does the same.

As the name suggests, there is a hierarchy in the files because files are sitting in some folder or subfolder.

Flat Namespace

Flat Namespace

On the contrary, a flat namespace has no hierarchy. Files are Folder are organized in a flat structure.

To understand this, imagine there is only one folder, and every file sits in that same folder. So within that folder since every file is at the same level, hence no hierarchy. For hierarchy, different files need to be in different levels (folder or subfolder)

The similarity between Data Lake and Actual Lake

There is a reason why people named Data Lake as a “Data Lake”. Nature is an inspiration for so many inventions and discoveries. We always steel the concept of nature and god and build something out of it. This is how “Data Lake” as a concept was born.

  1. Vast Storage Capacity: Just as a natural lake can hold a large volume of water, a Data Lake is designed to store massive amounts of structured and unstructured data. This data can include raw data, semi-structured data, and even large binary files.
  2. Variety of Sources: Similar to how rivers, streams, and rain contribute to the water in a natural lake, a Data Lake can accept data from various sources, such as IoT devices, weblogs, social media, databases, and more.
  3. Flexible Data Structure: Data Lakes allow for a flexible data structure. In a natural lake, water from different sources can mix without being constrained to a specific shape. Similarly, in a Data Lake, data can be ingested without the need for immediate normalization or strict schema.
  4. Analytics Potential: Just as a lake can be a hub of diverse ecosystems due to its resources, a Data Lake can serve as a hub for various analytical processes. Data stored in a Data Lake can be used for multiple purposes, including data exploration, machine learning, business intelligence, and more.
  5. Scalability: Natural lakes can expand based on factors like rainfall and water inflow. Data Lakes are also scalable, allowing organizations to expand their storage and processing capabilities as their data volume grows.
  6. Challenges: Both natural lakes and Data Lakes come with their challenges. Natural lakes can face pollution, ecological imbalances, and water quality issues. Data Lakes can encounter data quality issues, security concerns, and the challenge of managing and organizing data effectively.

Is Data Lake going to replace SQL?

The short answer is NO.

Data Lake is going to complement SQL. SQL will continue to work as a backend for all Websites and Apps to store transactional data. Data Lake will help to generate insights from the data sitting SQL.

I hope this blog helps you understand Data Lake.

Please comment and provide feedback.

Follow me on Linkedin, Github, and Medium Thanks for reading. Happy Learning

--

--

Nadeem Khan(NK)
LearnWithNK

Lead Technical Architect specializing in Data Lakehouse Solutions with Azure Synapse, Python, and Azure tools. Passionate about data optimization and mentoring.