Microsoft Fabric: Power Query vs T-SQL and PySpark
A simple table-join case comparison
Recently, while developing of a Power BI semantic model, I crashed against some simple join operations. Joining tables inside Power BI (in the Power Query editor) is not considered a best practice by some, but I wanted to give it a try while waiting for official back-end support. Of course, the operation raised some problems and I couldn’t move on with the semantic model.
So, based on this event, I wanted to do some tests using Microsoft Fabric to perform different joins and compare the results. In particular, this article presents a comparative analysis between Power Query, T-SQL, and PySpark for this kind of operation.
1. Problem statement
In the scenario I was involved in, I needed to join two tables whose key columns had a high cardinality (with a many-to-many relationship - just to make everything “easier” :D) and whose values were created as a concatenation of three numeric codes separated by “||” (i.e. each value was like “61003952||60013161||20240302”). The operation was a simple inner join, nothing more than that. A similar inner join was to be performed to replicate the Qlik where exists clause. To conduct the tests, I used:
- Power Query (online on Power BI Service using a Dataflow, and on Power BI Desktop).
- A Microsoft Fabric Lakehouse (created to store the source tables to be joined).
- A Microsoft Fabric Warehouse (created to use T-SQL to join the two tables and create a new one).
- A Microsoft Fabric Notebook (to use PySpark to perform the join operation).
For the following tests, I created a Microsoft Fabric workspace on Power BI Service using a Fabric Trial license (with no additional settings).
A. Power Query (on Desktop and online)
After uploading the two tables I needed, I used both the Power Query editor on Power BI Desktop and Power Query Online in a Dataflow to perform the join operation (the performance of the classic Dataflow and the Dataflow Gen 2 are similar). A first bell rang when I set the merge: the estimation of matches did not show anything, even after a long waiting time.
After that, during the evaluation process, no data preview was visible, and eventually, the following message was shown in the Dataflow:
In Power BI Desktop, I forced the editor to apply the changes (without waiting for the preview to show up) but in Power BI Desktop I ended up with an endless “Evaluating…”.
That was the problem I encountered the first time, and it caused me to find a different method to perform the join operation. But, before moving on, here is a small list of the why’s I couldn’t perform such an operation in Power BI Desktop:
- Power query is powerful but not so powerful: the editor is good for simple operations and transformations and is well suited for relatively small tables. When merges and complex transformations join the party, the classic statement is “Push everything in the back-end” i.e. let another tool do the job and use Power BI only to read the final result to avoid additional heavy operation.
- String key columns: in general, string columns have a larger size compared to numeric ones. The Power Query editor on Power BI Desktop does not compress data, therefore it “feels” the columns’ size resulting in a very slow join operation (data compression occurs later, after loading the data, when the VertiPaq storage engine compresses everything and creates dictionaries for fast reading and data retrieval).
- High cardinality: having a large number of distinct values means that Power Query needs to scan a larger number of strings and make a larger number of comparisons. Moreover, the many-to-many relationship leads to a larger number of rows making everything more severe. Below is a simple query (run in the SQL-Endpoint of the Fabric Lakehouse) to show the number of distinct values in the two key columns (~3mln vs. ~4mln):
B. CTAS in Fabric Warehouse using T-SQL
At this point, after the complete failure of Power Query, I needed something different and more powerful to perform the join. After I created a Warehouse, I ran the following CTAS using the available Fabric cross-warehouse query to join the two tables contained in the Lakehouse:
The join was successful! It took 6 minutes to create a big table of ~48 mln rows (remember the many-to-many relationship) without being stuck in an endless operation.
Just like that, I was very happy but I wanted to do more and test the Fabric Notebooks.
C. Fabric Notebook & PySpark
I created a Notebook and connected it to the Lakehouse containing the two tables I wanted to join.
Using the interactivity of the Notebook, first of all, I uploaded the two tables using Spark and a classic SELECT *, taking ~20 seconds (total 30 seconds minus 9 seconds for the Spark session to start):
After that, I performed the inner join using the common key column. The preview (1000 rows) took ~40 seconds to show up, contrary to Power Query which was not able to do it, even after several minutes.
Finally, the last operation was to write the results (i.e. the joined table) in the Lakehouse using the write command:
It took 4 minutes and a half to store the new table! Combining everything (i.e. tables upload, merge, storage), the total time was 5 minutes and a half, slightly better than the CTAS operation performed inside the Warehouse.
A comparable timing was obtained when I ran a more explicit query (i.e. without loading the two tables first):
So, in the end, where Power Query miserably failed, T-SQL and PySpark were absolutely successful, making them valuable tools for back-end operations such as the one described in this article.
Now, after these first tests, I wanted to conduct an additional experiment related to the translation of the Qlik where exists clause (for more details, see exists function). This operation corresponds to an inner join and retrieves only the values already present in another column of another loaded table. For the test, I still had string-type key columns but a lower cardinality (i.e. 350k distinct values in the left table, and 3k in the right table).
Despite its simplicity, I witnessed another Power Query failure...
What about T-SQL and PySpark?
D. Qlik “where exists” using T-SQL
The first test was inside the Fabric Warehouse. I ran the following T-SQL script that was incredibly fast:
Just 3 seconds to do something that Power Query was not able to do (or even to show in the preview), producing a small table of ~22k rows (values both in Table A and Table B).
E. Qlik “where exists” using PySpark
Doing a normal inner join will produce a bigger table because the operation will retrieve several rows from the right table for each key in the left table. To reproduce the where exists operation, I needed to use the dropDuplicates() on the joined table which I knew would take some seconds more. The final result was achieved in almost 20 seconds:
However, using an explicit SQL query (without loading the tables) led to a faster response (less than 10 seconds):
Conclusions
- Power Query is good for simple transformations (e.g. renaming, creating a new column with a simple logic, replacing values), both in Power BI Desktop and a Dataflow. Moreover, some of those operations can lead to query folding, “pushing” everything to the source to optimize the query execution (see Query Folding).
- Power Query can’t handle complex transformation operations, such as joining tables (such operations are not “pushed” to the source by the query folding process), especially when dealing with very large tables and high cardinality columns (of the order of millions of values). For this reason, the best practice is to let the back-end handle those operations and leave only the final step (i.e. reading the table) to Power BI.
- Using Microsoft Fabric might be the solution to solve many problems regarding those complex operations. The same operations are handled perfectly using T-SQL (in a Warehouse) and/or PySpark (in a Notebook), combining millions of values and in a small amount of time.
- Fabric Warehousing and Notebooks can be your ace up your sleeve to host and manage a back-end environment with ease.