Working with large amounts of data in Power BI
By Fredrik Bonnevie Dahler (Senior Technology Consultant), Ulrika Axelsson (Technology Analyst) and Mikkel Boye Ahlgren (Senior Technology Consultant) at BearingPoint
Many companies have large amounts of data in their data warehouses. In order to deliver a good user experience, reports both need high performance and have the granularity to drill into the details of the data. However, the need for near-instant access to large amounts of data is in many cases difficult to align with the size limitations of in-memory Power BI Datasets. Delivering high-performance reports on large amounts of data is possible, but it requires a combination of good design choices and sound modeling principles.
We will also delve into some aspects of a potential game-changer that came along with the recent introduction of Fabric, the Direct Lake mode for data access.
In this article, we highlight some important aspects to keep in mind when working with large amounts of data in Power BI:
- The Implications of using Import or DirectQuery, and the promises of Direct Lake mode.
- Understanding how storage in Power BI works.
- Rules of thumb for reducing the model size.
Choosing between Import and DirectQuery is a trade-off between performance needs, modelling flexibility and required storage optimization efforts. Is Direct Lake the catalyst?
It is not uncommon that companies have transaction datasets with millions of rows of data per day, containing numerous columns, many of which might have high cardinality (a large set of unique values). One of the fundamental design choices for a Power BI dataset is whether to use DirectQuery or Import mode for the fact tables.
An aspect that makes this a hard decision is that neither Import nor DirectQuery are flawless solutions. Depending on the subscription type, using Import mode sets a limit on the amount of data we can use without significant pre-aggregation efforts. On the other hand, DirectQuery has no size limitations on the underlying source data but comes with drawbacks such as higher response time, limited modeling flexibility, and limits on the number of rows that can be returned in a query. Below are some considerations when choosing between the two options:
DirectQuery avoids loading most data into Power BI but has limitations on performance and modelling flexibility
Using fact tables with DirectQuery comes with several implications. Here is a list of some of the most important ones:
- Data is only loaded into Power BI when the report queries the underlying data source.
- Limited data caching and each visualization object sends a query directly to the database.
- Each interaction with the report (sorting, using slicers, etc.) triggers a new query.
- Limits exist on the number of concurrent queries that can be sent and the number of rows that can be returned.
- There exist modelling limitations in both DAX and PowerQuery when using DirectQuery (For further insight, look at this article https://radacad.com/directquery-connection-in-power-bi-how-does-it-work-limitations-and-advantages).
- High costs of querying to underlying data sources: Solutions such as BigQuery and Snowflake incur costs based on the amount of data queries. Depending on how the Power BI dataset is set up, the amount of data queried, and the number of users, a DirectQuery setup might incur high Data Warehouse costs.
Benefits of DirectQuery:
- Avoids data loading and storage issues with large models.
- Data in the report is immediately updated along with changes in the source.
Drawbacks of DirectQuery:
- Report performance depends on the source query speed.
- Report performance is reduced when visualizing large amounts of data.
- With a high number of concurrent users, the user experience is affected if the dataset approaches its limit on number of generated queries.
- Lower modelling flexibility. Some effects are longer lead time on insights, decreased autonomy of teams and analysts/data generalists, higher load on data teams, or more generally, more difficult to do data democratization.
To increase performance we can make use of dynamic PowerQuery parameters (m-parameters). This allows filters to be applied at the innermost level of the source queries.
Aggregation tables can be created to make import-versions of frequently used queries. Despite being useful to improve model performance, this technique doesn’t work well for dynamic reports where users frequently drill down into highly granular data.
Import mode has high performance and modelling flexibility, but has limitations on model size and data update time
- There are limitations on the dataset size. Premium Per User has a 100GB size limit, and a P2 Premium Capacity has a 50GB size limit (as does S2 for Azure Analysis Services). In practice, the size limits are about half the original size since half of the capacity needs to be allocated during model refresh (https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-large-models#considerations-and-limitations). Workarounds may involve splitting the data into smaller datasets, but this may not be an ideal solution as it requires maintaining multiple datasets.
- Importing data increases the lead time for new data. Refreshing an import dataset adds additional lead time to the process of getting new data to end users. In some cases, we have experienced that refresh times can take up to 2.5 hours. Partitioning can be a relevant tool for reducing refresh time when data volume is the main contributor to time consumption, but in some instances it turns out that it really is compressing columns and indexing that are primary drivers. This issue is avoided with DirectQuery.
Let us suppose that we prefer modelling flexibility and performance, and accept that we need to optimize our Power BI datasets to cater for the Import mode storage limitations. What should we consider? There are a few techniques and tools that can be leveraged to do so. In the following sections, we will first explain the basics of how storage in Power BI Datasets works and later discuss techniques and tools for optimizing them.
Direct Lake promises both fast performance and live connections to parquet-files
While DirectQuery mode is slow but real-time, Import mode is fast but requires duplicating data, refreshes in order to update a dataset, as well as maintenance of Power BI datasets in general. Direct Lake mode promises to deliver the best of both worlds.
Direct Lake mode reads directly on parquet-files in OneLake (which is built on top of ADLS Gen 2). This removes the need for maintaining Power BI datasets, and enables report updates as soon as the parquet-file changes in OneLake. Furthermore, there is no translation to other query languages or execution in other database systems, which according to Microsoft will result in performance comparable to import mode.
Source: Microsoft (https://learn.microsoft.com/en-us/power-bi/enterprise/directlake-overview)
Fabric and Direct Lake are still in public Preview at the time of writing. Furthermore, in order to make use of this feature, you need a Premium Per Capacity or Microsoft Fabric license. See details here: https://learn.microsoft.com/en-us/power-bi/enterprise/directlake-overview#prerequisites.
It remains to be seen whether Direct Lake can eliminate all the drawbacks of the two current options, or if there will be use cases where each mode has an advantage over the other. While Direct Lake is worth monitoring closely, the current fully available solution requires a balancing between Import and DirectQuery mode. In the next sections, we’ll dive deeper into how data is stored in Power BI and how to handle storage issues when the amount of data grows large.
Power BI Datasets optimize storage under the hood, but tools also exist to facilitate manual storage optimization
Understanding how data is stored in Power BI is somewhat opaque. Tuning the storage and performance of Power BI datasets requires specific knowledge of how Power BI works under the hood. This topic is often not covered in tutorials, certification curriculums, or the Power BI documentation. This makes it difficult for data analysts or engineers to know what to consider when tuning Power BI datasets for handling large amounts of data.
Power BI data is stored in a database called VertiPaq, which is a columnar database. VertiPaq provides several advantages. Each column has its own structure and is physically separated from other columns. This separation allows for effective compression techniques to be applied independently to each column. Depending on the values within a specific column, VertiPaq selects the most suitable compression algorithm to minimize storage requirements.
Another important aspect is that VertiPaq is an in-memory database. This means that data is loaded and processed directly in memory, resulting in faster data retrieval and analysis, but is also why import mode has storage limitations.
A great tool for optimizing Power BI Datasets is VertiPaq Analyzer (https://www.sqlbi.com/tools/vertipaq-analyzer/). This powerful tool provides in-depth analysis and optimization recommendations for Power BI data models. By examining the VertiPaq engine’s performance and data compression, it helps identify potential bottlenecks and suggests improvements to enhance query response times, reduce memory consumption, and optimize data model design for better overall performance.
There are several techniques for optimizing storage in Power BI datasets. An in-depth elaboration on approaches and considerations is described in this brilliant article: “How to reduce your Power BI model size by 90%!” (https://towardsdatascience.com/how-to-reduce-your-power-bi model-size-by-90–76d7c4377f2d.) The sections below summarize the essence, and mentions relevant examples:
There are several ways to reduce the model size: Reducing the number of columns and cardinality are important levers
Here are some general rules for reducing the data model size:
- Include only the necessary columns and rows in the report.
- Optimize column cardinality and evaluate the benefits of splitting columns.
- Minimize the use of calculated columns, as they are not optimally compressed.
- Use appropriate data types based on the required granularity.
Focusing on rules 1 and 2:
- Import only the columns that are genuinely needed.
- This reduces the file size and optimizes memory consumption.
- In a significant example from our experience, we found a “Transaction-ID” column that wasn’t used in any of the visualizations on top of the dataset. The column contained a distinct string value for each transaction, and accounted for 40% of the total model size.
2. Reduce column cardinality.
- Higher cardinality makes it harder for the data storage engine to compress the data optimally.
- Split columns or modify data to reduce cardinality when possible. A typical example is splitting a Timestamp-column into Date and Time. Where a timestamp column over a year would have a maximum cardinality of 31 536 000 (365 * 24 * 60 * 60), splitting the column into separate Data and Time columns would reduce the cardinality to 86 765 (365 + 24 * 60 * 60). That’s a reduction of 98% (!).
- Splitting integer columns or decimal values can result in significant savings.
- Optimize date/time columns by removing unnecessary granularity if it meets the users’ requirements.
- Aggregate data whenever possible to reduce cardinality and the number of rows.
- Disable the Auto Date/Time option for data loading to eliminate unnecessary date tables (https://towardsdatascience.com/tiq-part-1-how-to-destroy-your-power-bi-model-with-auto-date-time-8fec32b22aff).
In conclusion, managing large volumes of data in Power BI requires careful choices for both design and implementation. Import mode is attractive because it offers performance and modeling flexibility, but it necessitates optimizing the dataset size. To achieve this, the VertiPaq Analyzer tool and the provided rules of thumb can help ensure that your dataset remains within its size limitations.
About the authors
Fredrik Bonnevie Dahler is a Senior Technology Consultant at BearingPoint in Oslo, with experience in data and analytics from various industries including the financial industry, retail, and data products. He has focused on a business-centric approach to deliver insights for stakeholders. Before joining BearingPoint in 2019, Fredrik obtained a master’s degree in mechanical engineering from the Norwegian University of Science and Technology. Email address: fredrik.bonnevie.dahler@bearingpoint.
Ulrika Axelsson is a Technology Consultant in the Data & Analytics team at BearingPoint in Oslo, with experience working with Power BI on several projects. Ulrika obtained a master’s degree in Industrial Engineering and Management from Umeå University in 2020. Email address: ulrika.axelsson@bearingpoint.com
Mikkel Ahlgren is a Senior Consultant in the Data & Analytics team at BearingPoint in Oslo. He has broad experience in creating commercial value by leveraging data and analytics. He has a Master’s degree from NTNU in Industrial Economics and Technology Management, specialized in computer science and optimization, and a Bachelor’s degree from NHH in economics and business administration. Email address: mikkel.ahlgren@bearingpoint.com