What is Azure Data Lake?

Nadeem Khan(NK)
LearnWithNK
Published in
3 min readAug 29, 2023

--

In the last blog, we learned about Data Lake as a concept. In this blog, we will be introducing Data Lake offered by Microsoft Azure which is Azure Data Lake Storage.

What?

Azure Data Lake is a storage that allows both structured (tables, excel, CSV, JSON) and unstructured (docs, PPT, images) data to be stored. Azure data is built for massive scalability for your data analytics solution at a very low cost.

Azure Data Lake Storage is very similar to Azure Blob Storage. Azure blob storage is a cloud storage, think of it as a Google Drive.

However, there is one major difference between them. Let’s find out.

Azure Data Lake Storage is also referred as ADLS Gen2. There is also ADLS Gen1 but it will be retired on Feb 29th 2024. In this blog, we will only learn about Gen2.

Azure Data Lake Storage vs. Azure Blob Storage

Azure Data Lake Storage is built on top of the Azure Blob Storage Account. The major difference between them is, that Azure Data Lake Storage uses a Hierarchical Namespace whereas Azure Blob Storage uses a Flat Namespace.

Hierarchical Namespace

To understand this, refer to your Laptop’s file explorer. Our File Explorer follows Hierarchical Namespace which means every file is inside some physical folder.

Flat Namespace

Flat namespaces don’t have a physical folder, it has a logical folder. Let’s see what that means:

learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

In the above URL, see we have ‘/’ as a separator. And now imagine the last word in the URL i.e.; “data-lake-storage-best-practices” is a file that contains best practices for data lake storage. Consider all the words other than “data-lake-storage-best-practices” as a folder that contains the file with the name “data-lake-storage-best-practices”. The folder didn’t actually exist, we have only added only logical folders.

Basic Terminology

In the ADLS Gen2, we use the below term very often, so it will be good to familiarize ourselves with the term:

  • File System: It is equivalent to the “C: Drive”, “D: Drive”, etc of your laptop
  • Directory: It is equivalent to the folders you create inside those drives.

Different Tiers in ADLS Gen2

ADLS Gen2 offer two types of tier. A tier is like a subscription plan. In the higher-end plan, you will get more benefits than the lower-end plan. But it doesn’t mean that everyone should buy a higher-end plan. For some people, a lower-end plan works best.

Security

Authentication and Authorization:

  • Azure Active Directory integration allows you to manage access to your storage accounts using Azure AD identities.
  • Role-based access control (RBAC) lets you assign fine-grained permissions to users and groups at the level of file system, directories, and files.

Shared Access Signatures (SAS):

  • SAS tokens allow you to generate time-limited URLs with specific permissions for accessing files and folders, without sharing the storage account’s keys.

Data Lake Storage Gen2 ACLs:

  • Access Control Lists (ACLs) provide more granular access control for directories and files within the storage account. It is like assigning roles to users for specific tables in SQL.

I hope this blog introduced you to the Azure Data Lake.

Please comment and provide feedback.

Follow me on Linkedin, Github, and Medium. Thanks for reading. Happy Learning

--

--

Nadeem Khan(NK)
LearnWithNK

Lead Technical Architect specializing in Data Lakehouse Solutions with Azure Synapse, Python, and Azure tools. Passionate about data optimization and mentoring.