Million Dollar Question in Microsoft Fabric — When to use Lakehouse VS Datawarehouse — Part1

FABRIC SERIES: FUNDAMENTALS 05 —DATATWAREHOUSE Vs LAKEHOUSE

RK Iyer

Published in

Microsoft Azure

8 min readMay 13, 2024

❑ Overview

One of the most common debates I hear between my son & his friend is Who is better Messi or Ronaldo? Similarly, one of the most common queries we get from our customers is Should I choose Lakehouse or Datawarehouse in Fabric? Which is better Lakehouse or Date warehouse? Which is better Lakehouse or Warehouse. In this blog we will try to get the answer by understanding the underlying architecture & going through the differences of Lakehouse & Datawarehouse in fabric. We will also have part-2 of the blog wherein we will look at different architectural patterns of using Lakehouse & Warehouse.

❑ Underlying architecture

Let's understand the underlying architecture what's beneath lakehouse & data warehouse. Under the hood, both store the data in One Lake in an opensource delta-parquet format.

What is detla? Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.
If you want to understand about delta, please refer my earlier blog Delta Lake 101

❑ Lakehouse

Microsoft Fabric Lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. The default file format is delta parquet. You can store data in 2 physical locations that are provisioned automatically, files (unmanaged) or tables (managed delta tables).

Lakehouses are composed of two top level folders:

▣ The Tables folder — Representing the managed portion of the lakehouse. You can store tables in various formats here: so, it’s not only Delta, but Parquet and CSV as well. One of the main limitations of non-delta format is that these tables will not be available for querying via the SQL endpoint provided by the Fabric Lakehouse (SQL endpoint —discussed few paragraphs away), nor for the Direct Lake scenarios with Power BI reports.

▣ The Files folder — Representing the unmanaged portion of the Lakehouse. You can store any type of file here: not only CSVs, parquet but also images, videos, and so on.

The concept of managed & unmanaged are same as concepts of table types in spark.

In Spark there are two types of tables available:
Managed tables: These are internally managed by Spark, meaning that Spark oversees both the data and metadata. The data is stored in a designated directory, while metadata, containing details about databases, tables, views, partitions, etc., is stored in a metastore. It’s crucial to note that when you delete a managed table, you lose both the data and metadata altogether.
Unmanaged (external) tables: With these tables, Spark only manages the metadata. You’re responsible for specifying the data’s storage location. Unlike managed tables, dropping an unmanaged table removes only the metadata; your data remains intact in the specified external location.

Similarly in Microsoft Fabric, a key distinction between Managed and Unmanaged (External tables) is that, in case you drop the external table, only metadata will be removed, but the “real” data will stay untouched in the external location.

■ Lakehouse SQL analytics endpoint

Microsoft Fabrics introduces a SQL-centric approach tailored for lakehouse delta tables, known as the SQL analytics endpoint. With this endpoint, you can delve into delta table data using T-SQL commands, employ saving functions, construct views, and implement SQL security measures.

When you establish a Lakehouse, it automatically sets up a SQL analytics endpoint linked to the delta table storage within the Lakehouse. Subsequently, any delta table created within the Lakehouse becomes instantly accessible for querying via the SQL analytics endpoint.

The SQL analytics endpoint operates in read-only mode over lakehouse delta tables similar to a view. So, you can access the data, but cannot do alter tables, add columns, updates, inserts, deleted.

You can set object-level security for accessing data using SQL analytics endpoint. These security rules will only apply for accessing data via SQL analytics endpoint.

■ Automatic table discovery and registration — The automatic table discovery and registration is a feature of Lakehouse that provides a fully managed file to table experience for data engineers and data scientists. You can drop a file into the managed area of the Lakehouse and the system automatically validates it for supported structured formats and registers it into the metastore.

❑ Data warehouse

In Microsoft Fabric, the Warehouse is a traditional data warehouse which supports transactions, DDL and DML queries. It also supports views, table value functions, and provides supported data ingestion methods. These include COPY INTO, Pipelines, Dataflows and leveraging cross database ingestion options through CTAS, INSERT..SELECT or SELECT INTO.

Data in the Warehouse is stored in the parquet file format and published as Delta Lake Logs, enabling ACID transactions and cross engine interoperability that can be leveraged through other Microsoft Fabric workloads such as Spark, Pipelines, Power BI and Azure Data Explorer.

Compute and storage are decoupled in a Warehouse which enables customers to scale near instantaneously to meet the demands of their business. The compute is a serverless infrastructure to allow for infinite scaling with dynamic resource allocation, instant scaling up or down with no physical provisioning involved, and physical compute resources are assigned within milliseconds to jobs. This enables multiple compute engines to read from any supported storage source with robust security and full ACID transactional guarantees.

■ Data Processing Architecture

As data is retrieved or ingested, it leverages a distributed engine built for both small and large-scale data and computational functions.

When a query is submitted, the SQL frontend (FE) performs query optimization to determine the best plan based on the data size and complexity & given to the Distributed Query Processing (DQP) engine.
The DQP orchestrates distributed execution of the query by splitting it into smaller queries that are executed on backend compute nodes.
Each small query is called a task and represents a distributed execution unit. It reads file(s) from OneLake, joins results from other tasks, groups, or orders data retrieved from other tasks. For ingestion jobs, it also writes data to the proper destination tables.
When data is processed, results are returned to the SQL frontend for serving back to the user or calling application.

■ Key differences that make Fabric Synapse Data Warehouse architecture better.

Elasticity and resiliency —

The backend compute capacity gains advantages from a swift provisioning architecture. While there’s no Service Level Agreement (SLA) regarding resource allocation, usually, new nodes are obtained in just a few seconds. As demand for resources grows, newly added workloads utilize the expanded capacity. Scaling occurs seamlessly without interrupting query processing, ensuring uninterrupted operations.

Scheduling and resourcing —

The distributed query processing scheduler operates at a task level. Queries are represented to the scheduler as a directed acyclic graph (DAG) of tasks. A DAG allows for parallelism and concurrency as tasks that do not depend on each other can be executed simultaneously or out of order.

As queries arrive, their tasks are scheduled based on first-in-first-out (FIFO) principles. If there is idle capacity, the scheduler might use a “best fit” approach to optimize concurrency.

When the scheduler identifies resourcing pressure, it invokes a scale operation. Scaling is managed autonomously, and backend topology grows as concurrency increases. When pressure subsides, backend topology scales back down and releases resource back to the region.

Ingestion isolation —

Warehouses in Microsoft Fabric leverage an industry-leading distributed query processing engine, which provides customers with workloads that have a natural isolation boundary. Loading activities are provided resource isolation from analytical workloads. This improves performance and reliability, as ingestion jobs can run on dedicated nodes that are optimized for ETL and do not compete with other queries or applications for resources. There are no knobs to turn with the autonomous allocation and relinquishing of resources to offer best in breed performance with automatic scale and concurrency built in.

❑ Can we access the data interchangeably between lakehouse & warehouse?

To access Lakehouse from Warehouse -

To query Lakehouse data from Warehouse, you need to have both artifacts in the same Fabric workspace. Fabric workspaces have a cross-database query capability that enables you to query Lakehouse data from your Warehouse using T-SQL. You do not need to use COPY INTO to load data into Warehouse if your data is already in the same workspace. If your data is in a table in Lakehouse and in Delta Lake format, you can query it directly from your Warehouse.

SELECT c.customer_id, c.name, SUM(o.amount) AS total_amount
FROM customers c
     JOIN RKLakehouse.dbo.salesorders o
          ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_amount DESC;

Using cross database queries, you can load data from lakehouse tables to warehouse tables, but you cannot load data from lakehouse files. If you want to load the data using Lakehouse files as source, then you can do it with pipelines and dataflows.

To access Warehouse from Lakehouse -

Yes. We can create a shortcut for a data warehouse table in Fabric.

❑ Differences of Lakehouse & Data Warehouse

Above table shows key differences between Lakehouse & Data warehouse. Please refer official documentation showing differences of Lakehouse & Data Warehouse. Fabric decision guide — choose a data store — Microsoft Fabric | Microsoft Learn

❑ Conclusion

I hope this blog helped you in understanding the difference between Data Warehouse & Lakehouse. For all the architects some of the most common question would Should I use Lakehouse or Datawarehouse? Can I use both? Is there any performance benefits by choosing one?…………We will continue this discussion in the next blog wherein in we will see the difference & different architectural patterns used by customers using Lakehouse & Warehouse. There is still more to come…Happy Learning!!!