Dumb Down Azure Databricks Delta Lake Architecture
In this modern data world, the importance of data has been increased exponentially and organizations are spending a vast amount of time and money on the new technologies that allow firms to quickly process and make a good sense of the data.
With the increased volume of the data, data processing ( ETL-Extract Transform and Load or ELT -Extract Load and Transform) and analysis (data analytics, data science, and machine learning) is becoming more and more time-consuming and companies are looking beyond the traditional data architectures to meet theirs on-demand analytical needs.
Delta Lake is one such solution that provides a massive improvement over traditional data architectures. It is an open-source storage layer that provides ACID transactions and metadata handling. It also unifies batch data and streaming data for building near-real-time analytics.
Here are a few key advantages of Delta Lake
• Handles high volume of data (terabytes or even petabytes) with ease
• Unified Batch and Stream processing with ACID (Atomicity, Consistency, Isolation, Durability) transactions
• Delta allows data writers to do Delete, Update, and Upsert very easily without interfering with the scheduled jobs reading the data set
• Delta records each and every action that is performed on a delta lake table since its creation. This enables the users to query an older snapshot of the data (time travel or data versioning)
• Delta Enforces the Schema and prevents bad writing.
• Supports multiple programming languages and APIs
• Delta Lake is the foundation of a cost-effective, highly scalable lakehouse
So let's jump into the architecture and see each stage of the architecture in detail (from left to right)
Components of the Delta Lake Architecture
Now let's take a sample data file and see how the data transform at each stage of the architecture.
we will use this CSV file and see how the data transitions from its raw state (Bronze) → curated State (Silver) → more meaningful State (Gold).
• Azure Data Factory pipeline to copy the ‘.csv’ file from the On-premise file system to Azure data lake storage (Bronze)
• Mount the Azure Data Lake to the Databricks File System (DBFS)
• Azure Data Bricks to Read the. CSV file from Bronze, apply
the Transformations and then write it to the Delta Lake tables (Silver)
• From Silver, Read the delta lake table and apply the aggregations and then write it to the Delta Lake tables (Gold)
• Now users can connect to either Silver or Gold tables for their data analysis ( BI Reporting or Machine Learning)
I hope you enjoyed reading this article, will see Delta Lake in Action in my next article until then have a happy thank-giving and enjoy the holiday season.