Best Practices For Using Power BI In DirectQuery Mode With Snowflake

Power BI has two fundamental ways of working with data: Import mode and DirectQuery mode. In Import mode the data for your reports is stored inside Power BI’s own native database engine; in DirectQuery mode no data is stored inside Power BI and every time you run a report Power BI queries your data source to get the data it needs. While import mode will give you the best performance in many scenarios, we also know some customers — including many who use Power BI with Snowflake — need to use DirectQuery. Examples of when you should consider using DirectQuery include when you are working with datasets that are too large to fit in memory, when you are seeking to leverage entitlements defined within Snowflake or when you need to see near real-time data in your reports. This blog post summarizes the best practices for using Power BI in DirectQuery mode with Snowflake.

Update December 2023:

Please also see Keith Smith’s recent article: Snowflake and Power BI: Best Practices and Recent Improvements

The first thing to point out is that there is already extensive documentation for using Power BI in DirectQuery mode here:

https://docs.microsoft.com/en-us/power-bi/guidance/directquery-model-guidance

Almost all this advice is relevant to using Power BI in DirectQuery mode with Snowflake and we strongly recommend that you read this article before proceeding. There are several topics that need to be added or expanded on that are particularly relevant to Snowflake users, however.

How you model your data is one of the most important factors determining how well Power BI works, and unlike many other BI tools Power BI is optimized for working with dimensional models and star schemas rather than flat, wide tables. Again, there is already detailed documentation on data modeling for Power BI here:

https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

It may be tempting to skip the data modeling part of your project and use Power BI with your data however it’s currently modeled, but while that may work in the short term you will inevitably run into problems sooner or later. It’s also true that there are disadvantages to dimensional modeling and star schemas too: for example joins between large dimension tables and fact tables may be slow. Nonetheless our extensive experience of Power BI projects shows that overall using a star schema is the best approach and the advantages outweigh any disadvantages. You should also avoid using Power BI features such as the Power Query Editor or creating calculated columns to work around any deficiencies in the way your data is modeled: all of your data modeling work should be done in Snowflake before you do any work in Power BI.

Once you have modeled your data properly there are several things you can do inside Snowflake to optimize the performance of the SQL queries that Power BI generates:

● You should have a dedicated warehouse for Power BI queries and size it appropriately. Auto-scale can help with variable concurrency; setting a maximum size that is too low can affect query performance.

Materialized views can be used to help query performance. Remember that the materialized view does not need to be mentioned in the SQL queries that Power BI generates in order for the Snowflake query optimizer to use it.

● The Snowflake Search Optimization Service may also improve performance when working with high-cardinality dimension columns.

● Setting a clustering key is important on large fact tables. You should consider including columns that are used in joins or in the GROUP BY or ORDER BY clause of SQL queries. Automatic clustering makes finding the optimal configuration much easier.

● Always set columns as not being nullable if they will never contain null values. If Power BI can see that a column is not nullable it will simplify the SQL that it generates in some cases.

● Try to ensure that referential integrity exists between the columns used to join dimension tables and fact tables. If this is the case you can set the “Assume referential integrity” property on your relationships in Power BI (as detailed in the best practices documentation linked to above), which in turn means Power BI will generate SQL queries with inner joins rather than left outer joins.

● Since Power BI will frequently run the same query against Snowflake, it is important to ensure that result caching is taking place.

In addition, when working in Power BI there are several other points to note:

● When creating a connection from Power BI to Snowflake you should always connect using an identity that can only access the tables or views needed for your dataset. This will usually involve specifying a Snowflake role and database in the advanced options section of the connection dialog. This is important because it reduces the scope of the queries Power BI generates to retrieve metadata from Snowflake; if the identity used to connect from Power BI to Snowflake has access to a large number of objects the performance of these queries can be slow.

● Enabling SSO on the connection between Power BI and Snowflake means that the identity of the user running the report is also used to send the SQL queries generated by Power BI back to Snowflake. Instructions for enabling SSO can be found here, here or here if you are connecting via an On-Premises Data Gateway. Enabling SSO means that Power BI reports respect any security rules you have defined in Snowflake.

● In order to reduce the effects of network latency and bandwidth, it is important to consider the location of your Power BI tenant or Premium capacity relative to the location of your data in Snowflake. If you are using Snowflake on Azure try to store your Snowflake data in the same Azure Region as your Power BI tenant’s default region; if you have a Power BI Premium capacity remember that you have the option of storing your capacity’s data in a different Azure Region using the multi-geo capacity feature.

● Although the connector gives you the option of using your own SQL query as the source for a table in your dataset we recommend that you do not do this: it can lead to performance problems while developing in Power BI Desktop and it also can make your solution harder to maintain since it hard-codes SQL inside your dataset definition. Going back to the point about modeling your data properly before you start developing in Power BI, you should always try to use tables or views as the source of tables in your dataset rather than ad-hoc SQL.

● In some cases where you need more control over how Power BI generates its SQL you can use the dynamic M parameters feature to pass selections made in slicers and filters in your report back to Snowflake, bypassing Power BI’s SQL generation layer. Rather than using this feature with SQL SELECT statements we recommend using Snowflake Tabular UDFs instead.

As mentioned in the DirectQuery guidance documentation, the “Maximum connections per data source” property controls the number of connections that your dataset can have open to Snowflake. The default value of 10 is the maximum value that can be used for a dataset when you are not using Power BI Premium, however this value can be increased to 30 for datasets that are published to a Power BI Premium capacity. Increasing the number of connections can make a noticeable difference to performance when you have a report page with many visuals on it or when some visuals generate slow queries. At present this property only takes effect when your dataset is published to the Power BI Service, so for this reason we recommend that you always test the performance of DirectQuery reports in the browser after they have been published rather than in Power BI Desktop — you may find that your reports perform noticeably slower in Power BI Desktop than in the Power BI Service.

● Following on from the previous point, you may find that connecting from Power BI to Snowflake via an On-Premises Data Gateway will allow increased parallelism and therefore better performance. Gateways adjust many settings, including the number of parallel DirectQuery queries allowed, depending on the CPU capacity of the machine that the gateway is running on; you can also manually adjust some of these settings as documented here.

● If you are connecting from Power BI to Snowflake via a VNet data gateway then creating high availability clusters will allow load balancing of requests to Snowflake over multiple VNet gateways, improving performance when there are a large number of concurrent users. Each individual VNet gateway can support six concurrent queries so you should take this into account when planning your clusters.

● If you are using Power BI Premium it may be necessary to change some workload settings on your capacity that are relevant to datasets. In particular the “Max intermediate rowset count” setting controls the maximum number of rows that can be returned by a SQL query in DirectQuery mode; you may encounter this limit when you have modeled many-to-many relationships in your dataset or are writing complex DAX expressions, but rather than increasing the value of this setting (which may just postpone when you hit the error, or cause other performance problems) you should try to remodel your data or rewrite your DAX expression to work around the problem.

● Power BI’s aggregations feature can be used to add tables containing pre-aggregated data to your dataset, which the Power BI engine can then use to improve query performance. Aggregation tables can be stored in Import mode or DirectQuery mode; DirectQuery mode aggregation tables can be used to solve many of the same problems as Snowflake materialized views so you may want to test the use of each to see which one gives you the most performance benefits. Power BI automatic aggregations can also be used with Snowflake, although not when SSO is enabled.

● Going back to where this post started, remember that Import mode may give you better performance than DirectQuery mode. Power BI allows you to mix and match storage modes, so even if you need to use DirectQuery mode for a large fact table you may still be able to use Dual mode for some or all of your dimension tables: this allows Power BI to switch between using Import mode and DirectQuery mode for tables and this is generally better for performance than using Import mode dimension tables with DirectQuery fact tables. More information on Dual mode can be found here. Similarly, Power BI’s “hybrid tables” feature allows a single data to use Import mode and DirectQuery for different date ranges; you can find instructions on how to set this up here.

Microsoft and Snowflake work together closely on the Snowflake connector for Power BI and we are committed to providing the best possible experience for our mutual customers. There are several features in the pipeline, both for the Snowflake connector and DirectQuery in general, that will significantly improve the performance of DirectQuery mode for Snowflake.

[Thanks to Chris Webb, Liping Huang, Akshai Mirchandani, Bob Zhang, Anantha Ganjam, Mahesh Prakriya, Paul DeFeo, Josh Crittenden, Nassim Kasdali and Miquella de Boer for their help with this post]

--

--