Azure Implementation of the Data Lakehouse

Anu
data-surge
Published in
5 min readOct 4, 2021

In my last post, I addressed what ‘Data Lakehouse’ really means. The theoretical concepts are one thing and implementation is a completely different ballgame. We have successfully completed a few lakehouse implementations for our clients in a very short amount of time and I wanted to share one of the architectures that is the building block for the Lakehouse Architecture. Here are the services that we used for each of the components within the Lakehouse Architecture:

  1. Data Ingestion Service: Azure Data Factory
  2. Data Governance Service: Azure Purview
  3. Analytics Services: MLflow
  4. Data Secutiry Service: Azure Active Directory
  5. Data processing Service: Azure Databricks
  6. Storage Layer: Data Lake
  7. Data Serving Layer: Power BI
Lakehouse Architecture on Azure

Lets walk-through some of the services within this architecture -

Azure Data Factory

Azure Data Factory is a data orchestration service which helps automate data movement and data transformation. It allows you to create data pipelines in order to transform, normalize and move the data. It also gives you the capability to schedule your pipeline runs. You can either use the Azure portal or use Azure CLI, Python or other languages to build your data pipeline. There are a few features that I believe are kind of cool :

  1. Schema Validation
  2. Inherent version control
  3. Metadata-driven copy tool
  4. Data monitoring API to view your data pipelines

You can create alerting to report on your pipeline success or failure using this or using LogicApps that can be triggered as part of the pipeline using Webhooks.

Azure Purview

Azure Purview is designed to bring all your data governance and data management needs into one place. Purview brings you a coherent and searchable data lineage from all of your data pipelines. Azure Purview relies on Apache Atlas and you are able to have a reliable and transparent lineage tracking from various services. For example, copy activity in Azure Data Factory or even being able to create custom lineage. You can find more information on this service here written by one of our team members.

Power BI

Power BI is an amazing visualization and reporting service provided by Azure. You can find more information on this service here written by one of our team members.

Azure Databricks

The foundational component within this architecture is Databricks that provides the backbone of any data processing needs in this architecture.

The data storage concepts remain the same — Bronze tier is the raw ingestion tier which maintains all your historical data but has no logic. The silver tier is the cleansed, enriched data with business logic applied. Gold tier is the final tier which includes the business-level aggregates and contains data of the highest quality.

Databricks brought together your data and made it easy for you to use it by combining the benefits of open format along with Data Warehouse advantages. The data platform is highly performant along with good reliability and ease of use. This all sounds easy conceptually but is really hard. Databricks helped drive it with the structured transactional layer that they provide as part of Delta lake. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.

Here is what Delta Lake provides to you at a high level:

  1. ACID compliance with tried and tested Spark still under the hood — it gives you transactions to make it reliable. There are no failed jobs, no partial results and, no issues concurrently reading and writing.
  2. Atomocity — so you get all of it or none. The transaction log is the mechanism through which Delta Lake is able to offer the guarantee of atomicity. DeltaLog uses _delta_log metadata directory to maintain an ordered record of every transaction log of a Delta table. Also, metadata and data is not treated separately. Metadata is viewed as data. The metadata is in JSON and parquet format giving the ability to scan quickly by parallelizing them across the cluster. When meta data is too small, then there is a shortcut to cache all the meta data in memory in a single code in a optimized cluster. This removes the need to keep catalog and data in sync. Users can delete the metadata files that are no longer needed by using VACUUM. Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory once a total of 10 commits to the transaction log have been made. In other words, they offer the Spark reader a sort of “shortcut” to fully reproducing a table’s state that allows Spark to avoid reprocessing what could be thousands of tiny, inefficient JSON files.
  3. Time Travel — As you write into a Delta table or directory, every operation is automatically versioned. Using a simple select statement, you can go back in time by either using timestamps or version number in the query.
  4. Multi-dimensional clustering using Z-order — Colocate related information in the same set of files by use Z-ordering. This helps to reduce data that has to be read in delta lake. This is usefuld for cases when you expect a column to be commonly used in query predicates and if that column has high cardinality. The query is simple — OPTIMIZE events WHERE (condition) ZORDER BY (eventType)
  5. Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data.
  6. Schema evolution —you can automatically evolve the schema by using merge schema operations. This allows Delta to continue reading old Parquet files without rewriting them if a table schema changes. There are guardrails built in which make it much easier to control the data quality.

If you would like us to evaluate and review your current progress with your Data Architecture or you have more questions after reading this blog, please email us at info@datasurge.com or complete the form on our contact us page.

References:

  1. http://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf
  2. https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
  3. https://docs.databricks.com/delta/best-practices.html
  4. https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
  5. https://databricks.com/discover/diving-into-delta-lake-talks/schema-enforcement-evolution
  6. https://www.mssqltips.com/sqlservertip/6862/databricks-delta-schema-evolution/

--

--