Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric

Patrick Pichler
Creative Data
Published in
9 min readOct 23, 2023

A comparison from a data architecture and workload perspective

Photo by Patrick McManaman on Unsplash

Introduction

It’s not a secret, the most part of Microsoft Fabric is a re-branding centered around Azure Synapse Analytics and Power BI. Numerous existing services have now been seamlessly consolidated, offering users a unified and streamlined experience. However, in addition to this powerful re-branding initiative, there have also been introduced some exciting enhancements and new features. One of it is the re-architecture of the SQL distributed engine Polaris powering the Fabric Data Warehouse. It now allows to fully operate on the Delta format stored in a Data Lake (OneLake on ADFS). As such, it’s the first engine providing full ACID compliant transactional workloads on top of an open table format, which is truly kind of revolutionary.

But wait, isn’t that what the Data Lakehouse architecture has been claiming for years and what’s actually the difference to the Fabric Data Lakehouse offering? It also operates on the same storage system and open table format? Why are there two offerings in the first place and which one should I choose?

This is exactly where this article will shed some light on. It also highlights the importance of focusing on data workloads when choosing between one over the other, not only personas and skill-sets.

Don’t opt for the Data Warehouse just because your data team is only familiar with SQL.

Fabric Data Warehouse vs. Data Lakehouse

To begin with, many available resources place significant emphasis on personas and skill sets when comparing the two offerings, also the high-level decision guide by Microsoft, which provides an excellent first overview though. However, while expertise is undeniably important and should factor into your decision-making, it should not overshadow the primary objective: establishing a robust and sustainable data architecture that guarantees data quality and aligns with internal SLAs. For this reason, it’s essential to compare the two offerings from a more data workload perspective and analyze the technological shortcomings and benefits of each tool. By doing this, most of it ultimately boils down to the computing engines used behind the scenes. This is Microsoft’s proprietary engine Polaris powering the Data Warehouse and the open-source Spark engine is leveraged by the Data Lakehouse, possibly with some slight modifications by Microsoft. Comparing these underlying computing engines, gives you a better understanding of how each offering handles data workloads, what scalability and performance characteristics they offer, and how well they align with your specific use cases and requirements.

POLARIS — Data Warehouse

Let’s start with the Polaris distributed engine that has already been around for several years. In fact, it has been powering the Azure Synapse Analytics Serverless SQL Pool since the beginning. The roots of it might even go back to SQL Server 2016 when Microsoft began offering the data virtualization technology PolyBase. It was the first Microsoft proprietary technology that allowed to query data from files and object stores based on metadata definitions such as OPENROWSET. This was also the beginning of bringing data warehousing and big data workloads closer together. Though, until now, everything in Polaris worked in a kind of schema-on-read fashion, meaning you couldn’t WRITE data.

This has changed now, the Polaris engine has been re-architectured from the ground up to also support the above mentioned ACID compliant transactional workloads. It got enhanced by integrating the best of the Azure Synapse Analytics Dedicated SQL pool engine as well as SQL Server. This allows you to write T-SQL code containing multiple read/write statements across multiple Delta tables without fearing to introduce data inconsistency. Behind the scenes, it makes usage of ROLLBACK and COMMIT just like atomic transactions in relational databases, everything is either committed or rolled back in case of an error. The tables stored in the Data Warehouse can further only be MODIFIED by Polaris, not any other external tool or process, not even by yourself via the file system. It’s totally locked up to not corrupt any data, giving you a kind of more relational database feeling, even though you can have a look on the Delta table structure in the file system. It’s also technically possible to READ this data outside of the Data Warehouse by using any other tool supporting Delta format such as the Fabric Data Lakehouse.

All these characteristics make the Data Warehouse a perfect fit for typical batch-oriented data warehousing workloads accommodated in Stored Procedures, requiring robustness and full ACID transactional guarantees. In this regard, you nowadays also often hear the term multi-statement transaction or multi-table transaction, also mentioned in the Microsoft documentation. This got somehow introduced with the Data Lakehouse architecture and open table formats, but it’s basically nothing more than the “A” in ACID standing for Atomicity, read more here and in the next section. As the final point, POLARIS has also proven to outperform the SPARK engine in both reading and writing data, even though both engines leverage the V-ORDER compaction by default. Caution though, at the time of this writing, the Data Warehouse provides limited T-SQL functionality and certain T-SQL commands can cause warehouse corruption, read more here. Further, table constraints such as Primary or Foreign Keys cannot be enforced, they might just result in a better execution plan but don’t guarantee uniqueness or data integrity, read more here.

Common scenario

A very typical use-case for utilizing the Data Warehouse could involve the integration of just one or possibly two data sources from databases or cloud services at relatively small scale with infrequent data updates. In such a scenario, the Data Factory offers lots of built-in connectors allowing you to ingest data with ease directly in to the Data Warehouse “Data Staging” layer, potentially into a separate schema. Subsequently, still within this layer, you would normally merge the new data with the existing one and possibly establish historical records while maintaining the integrity of your raw data. This staging area should be a consistent layer from where you should always be able to recreate the entire logic downstream — requirements will definitely change over time. Moving forward, you would normally enrich and cleanse your data in the same step as you bring it into an analytical data model, the “Data Warehouse” layer. This is supposed to be a consolidated and permanent storage area for data in a rather denormalized format with most of the business logic already applied. Finally, you would proceed with creating business-ready views in your “Data Marts” layer, incorporating relevant security access measures and maybe even more business logic. In some cases, you might want to materialize the data into this last layer depending on the requirements.

Data Warehouse in Microsoft Fabric
Image by Author

SPARK — Data Lakehouse

Let’s now switch over to the Spark engine behind the Fabric Data Lakehouse. It’s essentially the same as for Azure Synapse Analytics with the difference of having the key performance features turned on by default, especially with regards to Delta format and V-Order. It also auto-compacts when writing data, meaning no difference to the Data Warehouse in terms of storage format and structure. However, it is crucial to note that the desired level of Atomicity mentioned above is not entirely guaranteed within the Data Lakehouse when writing data, as one might typically expect. In fact, it’s also not supported by any other Spark-backed service available at the time of this writing, including Databricks. In these scenarios, transactions are solely managed at a table level, and the grouping of multiple operations into a single transaction is not possible. Consequently, these operations will always be committed in a serial fashion, completely separated from each other, even if they are written within a single SQL-based notebook cell and may appear as a single transaction. The only way to combine inserts, updates, and deletes into a single transaction is using the MERGE statement.

All right, now let’s set aside the negative aspects of Spark. Spark’s ecosystem is tremendous and it’s growing every day. You gain access to different programming languages (most importantly Python and R) along with their extensive list of libraries, and you can seamlessly blend them together with SQL. Additionally, Spark offers robust capabilities for data streaming and machine learning, one of the main reason for the DATA LAKEHOUSE classification as opposed to DATA WAREHOUSE. You can effortlessly transition between batch and streaming modes, and the MLflow integration gives you access to the industry-standard platform for managing complete machine learning lifecycles (MLOps). Spark has also proven to be production-ready over the last couple of years, especially in very large-scale scenarios. The Fabric Data Lakehouse further enhances data integration with so-called OneLake shortcuts, complementing the already vast amount of Spark’s built-in connections, which brings me to the integration of Spark into the Fabric UI in general. It is very well done, especially the Lakehouse explorer. It allows you navigate and preview your tables as well as files in the same view, streamlining the entire development process. It also scans your files and creates Delta tables automatically for you, known as automatic table discovery and registration. Finally, you can query your Data Lakehouse tables either using the SQL endpoint (exposed with a little delay) or the Direct Lake mode (only supported by Data Lakehouse tables at the time of this writing). Former is backed by POLARIS giving you an improved read-performance compared to SPARK including advanced access control such as RLS. On the other hand, the latter is Power BI querying the Delta tables (parquet files) directly from the data lake giving you even more performance and also access to the most current data.

Common scenario

Considering all these factors, you should opt for the Data Lakehouse when your use-case requires more than just basic reporting and once-per night batch job. It allows you to write the most complex ETLs, switching seamlessly between streaming and batch mode, and tackle even the most advanced analytics use-cases, although it may pose some additional challenges in ensuring complete data accuracy. A typical use-case involves the Apache Spark Structured Streaming API alongside a dedicated “Landing Zone”. This design achieves a critical separation of concerns, effectively decoupling data integration from data processing. It gives you a unified, scalable, and central entry point for data arriving from diverse sources, including IoT devices, applications, databases, and other external systems. The Structured Streaming API then takes charge of managing incoming data stored in various file formats, including Delta tables. It incorporates an automated check-pointing mechanism with multiple trigger options. This flexibility empowers you to execute the same data pipeline, whether it’s coded in Python, SQL, or another language, in either streaming or batch mode. Think of it as a fault-tolerant data pipeline with built-in Change Data Capture (Delta CDF) that runs in the background or can be triggered at any time. Additionally, you can achieve nearly real-time analytics by using the Direct Lake mode or Power BI’s push dataset, in addition to the SQL endpoint.”

Data Lakehouse in Microsoft Fabric
Image by Author

Better together

Well, now let’s just put two and two together. What if you want to combine both, providing near real-time analytics and machine learning functionality while ensuring the highest data accuracy, even involving numerous multi-statement transactions? In this case, your architectural blueprint would resemble the following. The only change here is the final step, transitioning from the SPARK engine to the POLARIS engine, which enables this enhanced robustness and transactional guarantees while maintaining the flexibility and scalability in the data integration process by using the SPARK engine. Of course, you might want to extend this architecture and also include the Data Factory for some parts of the data ingestion.

Blending Data Warehouse and Data Lakehouse in Microsoft Fabric
Image by Author

Conclusion

In conclusion, I hope this article made its point that both engines have their benefits and that’s important to not get off the wrong foot when starting with Fabric. Otherwise, a migration or redesgin is just a matter of time. Therefore, opting for the Data Warehouse just because your data team is only familiar with SQL might not be the best approach. In fact, you can develop data pipelines in Spark with up to 90% of the code written in SQL. Besides, what’s wrong about learning a little bit of Python (PySpark) as it seems to accompany us for a very long time in our data and analytics journey. In the end, to maximize the potential of Fabric, a hybrid approach that combines the Data Warehouse and Data Lakehouse is likely the most effective solution. Moreover, we can even hope for seeing a unification of these two offerings soon, possibly with Polaris serving as a plug-in query execution engine for Spark.

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.