Million Dollar Question in Microsoft Fabric — When to use Lakehouse VS Datawarehouse — Part2

FABRIC SERIES: FUNDAMENTALS 06— DATATWAREHOUSE Vs LAKEHOUSE — ARCHITECHTURAL PATTERNS

RK Iyer
Microsoft Azure
6 min readMay 20, 2024

--

❑ Overview

Lakehouse VS Datawarehouse

In the last blog, we went through understanding the fundamental difference between Data Warehouse & Lakehouse. In this blog, we will delve more deeper in this aspect, wherein we will look at different key decision points, architectural patterns & key considerations which influence using Lakehouse & Warehouse in your solution.

❑ Data Warehouse Pattern

A data warehouse architectural pattern is a structured framework used to manage and organize large volumes of structured data for analysis and reporting purposes.

◪ Key considerations for employing this architecture pattern:

🗹 All data sources are structured data (Databases, CSV files, JSON files) or semi structured sources (Logs, Sensors) and future data sources are also going to be the same.

🗹 The development team primarily possesses T-SQL and Data Warehousing skills.

🗹 If the system is being migrated from an existing Data Warehouse like Synapse Dedicated Pool or Redshift, it might be more straightforward to continue using a Data Warehouse to leverage existing investments and minimize disruption.

🗹 Data transformation is performed using stored procedures.

🗹 There is a need to support features like multi-table transactions, which are exclusive to the Data Warehouse.

🗹 Although transactional workloads can use the Data Warehouse, they can also utilize the Lakehouse depending on the specific requirements of the application or ETL/ELT processes.

🗹 Users and developers require T-SQL DDL/DML capabilities.

🗹 The workload demands that users be able to modify data even after it has been normalized or transformed.

Below figure represents key components of a data warehouse architecture:

Data Warehouse Architectural Pattern

Data Sources

  • Incorporates data from other Azure Data Services, various cloud platforms, on-premises sources, and more.

Pipelines & Dataflows

  • Fabric Pipelines offers the ability to ingest data into your warehouse with 200+ native connectors.
  • You can use copy activities in pipelines to land the data or dataflows to land the data with in-flight transformations if needed.

Prep & Ingest

  • It is still advisable to maintain a clear separation of data zones within your Data Warehouse design including zones such as a landing zone, staging zone, and production zone.

Data Transformations:

  • SQL stored procedures orchestrated through Fabric pipelines are recommended.

Store

  • Once the data are transformed, they are stored in the form of tables for consumption. Additionally, views and functions can be created to customize and control the end user’s experience and access, similar to a traditional SQL environment familiar to many users.
  • Using SQL granular permissions, you can specify the access level for each user or security group for each table. Object-level, row-level, and column-level security can be applied, along with dynamic data masking to protect sensitive information.

Serve & Present

  • Users/teams will consume data through Power BI via reports, datasets, dashboards, apps, etc. That are sourced from the gold warehouse or tables.
  • Users and teams can import data into various reporting tools and even perform cross-database queries between Lakehouses(if any) and Data Warehouses to meet different reporting needs. This approach supports exploration, ad hoc analysis, and the execution of DDL/DML statements by both business users and SQL analysts/developers.
  • 3rd party reporting tools or other processes that require functionality only available through the Data Warehouse endpoint.

❑ Lakehouse Pattern

A lakehouse architecture is a hybrid data management framework that combines the capabilities of data lakes and data warehouses to handle both structured and unstructured data. It allows for scalable storage, efficient data processing, and supports advanced analytics and reporting. Typically, a medallion architecture is used in Lakehouse by segregating the data in Bronze, Silver & Gold layer.

◪ Key Considerations for Choosing Lakehouse Architecture Pattern

🗹 All data sources are mostly Un-structured data (Images, Videos), Semi-structured (Logs, Sensors) & also consist of Structured sources and future data sources are also going to be the same.

🗹 The development team primarily has skills in Spark.

🗹 There is no need for advanced Data Warehouse capabilities, such as multi-table transactions and dynamic data masking.

🗹 Users and developers do not require T-SQL DDL/DML capabilities. (Honestly, I have seen many startups who feel cool about Spark and don’t prefer SQL)

Lakehouse Architectural Pattern

▣ Data Sources:

Data can come from various sources, including files, streaming data, on-premises locations, Azure, other cloud providers, shortcuts, or within Fabric itself.

▣ Prepare and Transform:

The main aim of ingestion is to land the data in the Bronze Lakehouse (raw layer) through Fabric Pipelines, Shortcuts or even jobs.

Bronze Lakehouse:

  • Whenever feasible, retain the data in its original format; if not, convert it to Parquet or Delta Parquet using the “Files” layer in Lakehouse.
  • The data in bronze layer is pushed into next layer for further processing.

Silver Lakehouse:

  • In this area you will start to verify the quality of your data, enrich your data by combining other data sources together, transforming your existing data, data cleansing, etc.
  • The data in silver layer is stored as a managed delta tables to take advantage of Delta-Parquet & V-Order write-time optimization which allows extremely fast reads by the different compute engines available in Fabric be it SQL engine, Power BI engine, Spark engine.

Gold Lakehouse:

  • In this area the data will be “business ready” form to be consumed by various business users.
  • The data in Gold layer is also stored as a managed delta tables to take advantage of Direct mode with an ability to analyze large data volumes in Power BI. The data in Gold layer is typically in a STAR schema data model, data is normalized, and business logic has been applied so the data is ready for consumption.

Direct Lake mode is a semantic model capability for analyzing very large data volumes in Power BI. Direct Lake is based on loading parquet-formatted files directly from a data lake without having to query a lakehouse or warehouse endpoint, and without having to import or duplicate data into a Power BI model.

▣ Consume

Users will consume and analyze the data through SQL Analytics Endpoint of the Gold Lakehouse wherein Business users can query the data using T-SQL queries or consume data through Power BI via reports, datasets, dashboards, apps, etc.

❑ Combining both the Lakehouse and Data Warehouse into a single architecture

In this architectural model, the combination of Lakehouse & Data Warehouse is used to take advantage of the best of both worlds.

Lakehouse & Data Warehouse

With this architecture, the consumption layer (via Power BI or warehouse endpoint) can fully utilize the Data Warehouse, including DDL/DML support for users and developers at the gold layer, as well as providing a Data Warehouse endpoint for other reporting tools and processes. This approach ensures optimal performance and avoids data duplication across different consumption methods.

❑ Decision Tree

Below figure represents key points to be considered while choosing Data warehouse or Lakehouse or even combination of both.

Decision Tree of Lakehouse VS Warehouse selection

❑ Conclusion

I hope this blog helped you in understanding the difference architectural patterns for Warehouse & Lakehouse along with the key considerations for choosing the same. There is still more to come…Happy Learning!!!

❑ Reference

Fabric decision guide — choose a data store — Microsoft Fabric | Microsoft Learn

Lakehouse vs Data Warehouse vs Real-Time Analytics/KQL Database: Deep Dive into Use Cases, Differences, and Architecture Designs | Microsoft Fabric Blog | Microsoft Fabric

Learn about Direct Lake in Power BI and Microsoft Fabric — Microsoft Fabric | Microsoft Learn

Please Note — All opinions expressed here are my personal views and not of my employer.

Thought of the moment-

Innovation distinguishes between a leader and a follower.” — Steve Jobs

--

--

RK Iyer
Microsoft Azure

Architect@Microsoft, Technology Evangelist, Sports Enthusiast! All opinions here are my personal thoughts and not my employers.