Building a Data Lakehouse Using Azure HDInsight

Published in

Microsoft Azure

4 min readApr 14, 2023

As data volumes continue to grow, organizations are facing new challenges in managing and processing their data. The traditional approach of using separate systems is no longer sustainable, as it leads to increased complexity, higher costs, and slower time to insights. To address these challenges, many organizations are turning to a new data architecture called the data lakehouse.

A lakehouse is a modern data architecture that combines the best of data warehousing and data lake technologies. It provides a unified platform for storing, managing, and processing data at scale, while also offering the flexibility and agility of a data lake (see lakehouse references below).

Most people are familiar with building a lakehouse using Azure Databricks or Azure Synapse. In this article, I’ll show you how to build a data lakehouse using Azure HDInsight. So, let’s get started!

Ingest Data

The first step in building a lakehouse is to ingest data into it. Data is typically ingested in its raw form, without any transformation or processing. This allows you to store all your data in a single location and apply different processing and analysis techniques as needed.

To ingest data into your lakehouse, you can use a variety of tools and technologies, and different paths:

(1) Events are ingested into a Cloud-based gateway in a real-time fashion, in this case using HDInsight Kafka, data coming from different data sources (e.g., operational DBs) and formats (e.g., json, avro).
(2) Azure Data Factory (ADF) is used for batch ingestion. It’s common in real-time scenarios that the cold path is also handled by ADF for keeping raw data and reprocessing needs.
(3) Spark Structured Streaming writes streaming data into the Bronze layer with low latency, reading data from Kafka topics.
(4) ADF stores incoming batch data (e.g., incremental) in its raw form. This is the as-is/raw/landing data.

Process and Analyze Data

With data ingested into the lakehouse with the Medallion architecture, the next step is to process and analyze it using e.g. Delta Lake. Delta Lake provides ACID transactions, schema enforcement, and other features.

To process and analyze data in the lakehouse, you could use Apache Spark or Apache Hive on HDInsight. As per diagram above:

(5) The Spark batch job on HDInsight Spark will read data from the Bronze layer, creating filtered, cleansed, and enriched data for the Silver layer. At this stage, data is more sanitized, cleaner, and filtered to give a more refined view of the data.
(6) By this time, data is cleansed, filtered, and enriched, and the Spark batch on HDInsight Spark reads Silver data and creates a Gold layer by summarizing data, adding business-level aggregations for ML & AI needs, etc. The Gold layer provides a high degree of quality and data integrity; this layer is ready for enterprise consumption.
(7) The data service layer is logical, using Gold data stored on ADLS Gen2. This layer provides Hive Tables, Spark Tables, and ADLS Gen2 storage for consumption.

Build Data Products and Applications

With data processed and curated in the lakehouse, the final step is to build data products and applications that can deliver insights and value. This can include dashboards, reports, and visualizations that provide a high-level view of your data, as well as building machine learning models that can make predictions and recommendations based on your data.

To build data products and applications, you could also use a variety of tools and technologies such as Power BI or Azure Machine Learning:

(8) Data consumption can happen in multiple ways based on the end consumer’s need. For example, Power BI can use an Interactive query connector or Spark Connector for interactive query workload and ODBC for non-interactive workload.
(9) Data scientists using Azure Machine Learning or other data consumers can also consume data directly from Hive/Spark Tables.
(10) Additionally, other data consumers such as APIs can read data from ADLS Gen2 by giving individual consumers fine grained ACLs.
(11) The Spark Job streams insight events for others’ consumption needs using HDInsight Kafka. Any external system(s) can consume events from Kafka for business needs.

Monitor and Manage Data Access

To manage monitoring and data access in Azure HDInsight, you can use Log Analytics and Apache Ranger, respectively.

(12) Data access from Hive Tables is natively integrated using the Apache Ranger (table and column level). Data access from Spark Tables is managed via ADLS Gen2 ACLs. Microsoft Purview integration is still not supported for HDInsight.
(13) The observability of the HDInsight is enabled via logs, metrics, and traces, integrated with Azure Monitor.

For any suggestions or questions, feel free to reach out :)

Tech choices

The same pattern can be implemented using other tech choices on Azure:

Other options to HDInsight Kafka includes: Event Hubs Kafka API, Apache Kafka on AKS, or Apache Kafka on Azure Confluent Cloud.
Other options to HDInsight Spark includes: Azure Synapse Spark, Azure Databricks, or Apache Spark on AKS.
Other options to Azure Data Factory includes: Azure Synapse Pipelines, Azure Data Factory Managed Airflow, or Apache Airflow on AKS.
Other options to HDInsight Hive includes: Spark alternatives above, or Azure Synapse Serverless.
Other options to Delta Lake includes: Apache Hudi, or Iceberg.

Hence, a mapping architecture for Azure Synapse could look like this:

References: