Analytics Vidhya
Published in

Analytics Vidhya

Dumb Down Azure Databricks Delta Lake Architecture

In this modern data world, the importance of data has been increased exponentially and organizations are spending a vast amount of time and money on the new technologies that allow firms to quickly process and make a good sense of the data.

With the increased volume of the data, data processing ( ETL-Extract Transform and Load or ELT -Extract Load and Transform) and analysis (data analytics, data science, and machine learning) is becoming more and more time-consuming and companies are looking beyond the traditional data architectures to meet theirs on-demand analytical needs.

Delta Lake is one such solution that provides a massive improvement over traditional data architectures. It is an open-source storage layer that provides ACID transactions and metadata handling. It also unifies batch data and streaming data for building near-real-time analytics.

Here are a few key advantages of Delta Lake

• Handles high volume of data (terabytes or even petabytes) with ease
• Unified Batch and Stream processing with ACID (Atomicity, Consistency, Isolation, Durability) transactions
• Delta allows data writers to do Delete, Update, and Upsert very easily without interfering with the scheduled jobs reading the data set
• Delta records each and every action that is performed on a delta lake table since its creation. This enables the users to query an older snapshot of the data (time travel or data versioning)
• Delta Enforces the Schema and prevents bad writing.
• Supports multiple programming languages and APIs
• Delta Lake is the foundation of a cost-effective, highly scalable lakehouse

So let's jump into the architecture and see each stage of the architecture in detail (from left to right)

Azure Databricks Delta Lake Architecture

Components of the Delta Lake Architecture

Now let's take a sample data file and see how the data transform at each stage of the architecture.

we will use this CSV file and see how the data transitions from its raw state (Bronze) → curated State (Silver) → more meaningful State (Gold).

• Azure Data Factory pipeline to copy the ‘.csv’ file from the On-premise file system to Azure data lake storage (Bronze)
• Mount the Azure Data Lake to the Databricks File System (DBFS)

• Azure Data Bricks to Read the. CSV file from Bronze, apply
the Transformations and then write it to the Delta Lake tables (Silver)
• From Silver, Read the delta lake table and apply the aggregations and then write it to the Delta Lake tables (Gold)
• Now users can connect to either Silver or Gold tables for their data analysis ( BI Reporting or Machine Learning)

I hope you enjoyed reading this article, will see Delta Lake in Action in my next article until then have a happy thank-giving and enjoy the holiday season.

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Data Cleaning in R: 2 R Packages to Clean and Validate Datasets

A limitation of Random Forest Regression

Time Series From Scratch — AutoRegression Theory and Implementation

How I used machine learning to anticipate clients churn

4 Steps to Break Into Data Science in 2020

How To Study Mathematical Concepts For Data Science | A Career Transition Guide By Learnbay

Top 3 Classification Machine Learning Metrics — Ditch Accuracy Once and For All

Visualize Categorical Relationships With Catscatter

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Srini Velamakanti

Srini Velamakanti

More from Medium

Delta Lake Scanning with Azure Purview (and Apache Spark)

ETL with Azure Synapse Spark Pools

How to get insights from your data using Azure Databricks?

How to trigger Azure DevOps Release pipeline from Databricks