Demystifying Power BI — Data Management
A summary and overview of data management and supporting components in the Power BI ecosystem
Power BI provides enterprise-grade data management capabilities and supports deployments where experts can centrally curate data and report users, and analysts can access it securely. Data management and hosting in Power BI allows enterprises to reduce latency, duplication, and silos, enforce governance and simplify security across data, processes, and reports.
In this article, we will talk about Power BI's data management capabilities. We will discuss the data models, dataset modes, storage, data preparation, refresh options, and supporting components.
This article is a part of the multipart series on the Power BI ecosystem and continues the concepts introduced in the introduction article and refers heavily to the Power BI back-end platforms. We recommend reading through both articles for a complete understanding.
Power BI lets authors connect to and source data from all sorts of data sources and bring it together in one place. Data hosted in Power BI are not databases but data models. Data models represent a source of data ready for reporting and visualization.
Data Model
Data models provide a relational data structure like traditional data views and marts, where the data is gathered from multiple sources and stored in a relational format for reporting. For example, a data model might combine a product table and sales details from different databases or files.
Types of Data Models
There are four primary types of data models supported in Power BI.
- Power BI datasets: Power BI datasets are the native and most frequently used data models created using Power BI Desktop, Power BI service, and Power BI datamarts. Under the hood, every Power BI dataset is an Analysis Services tabular model. A tabular model is a data model consisting of one or more tables, relationships amongst those tables, hierarchies, and calculations.
- External-hosted models: Power BI supports SQL Server Analysis Services and Azure Analysis Services models for reporting and visualization. These models are hosted on their respective infrastructure and outside the Power BI back-end platform. Additionally, Power BI can connect to multidimensional models (cubes) and tabular models using live connection mode.
- Excel workbook model or CSV: Connecting to Excel workbooks and CSV files hosted in a Power BI workspace results in the automatic creation of a model. For example, excel tables and CSV data are imported to create model tables, whereas an Excel workbook data model is transposed to create a Power BI model.
- Streaming datasets: Real-time streaming datasets are helpful for streaming data and updating visuals and dashboards in real time. The source of streaming data can be factory sensors, social media sources, service usage metrics, etc. Power BI provides four types of real-time datasets: Push dataset, Streaming dataset, Hybrid dataset, and PubNub streaming dataset.
The following table summarizes dataset limits by license and platform type.
Detailed reading on datasets in Power BI:
Table Storage Modes
Each table (except calculated tables) in a Power BI dataset has a storage mode property. The storage mode determines whether table data is stored or cached in the model. For example, storage mode can be one of the following:
- Import: Tables with import mode are stored in the dataset. DAX queries submitted to the Power BI dataset could be fulfilled only from the stored data.
- DirectQuery: Tables with DirectQuery aren't stored in the dataset. DAX queries submitted to the Power BI dataset can be fulfilled only by executing on-demand queries to the data source.
- Dual: Tables with dual mode can act as either cached or not cached, depending on the context of the query submitted to the dataset. Power BI determines the most efficient plan, striving to use stored data whenever possible.
Dataset Modes (aka Model Framework)
The storage mode of tables in a dataset determines the dataset mode. Dataset modes apply to the Power BI datasets and externally hosted Analysis Services models.
- Import mode: An import model comprises tables with their storage mode property set to Import. Import mode delivers high-speed performance thanks to in-memory caching and querying.
- DirectQuery mode: A DirectQuery model comprises tables with their storage mode property set to DirectQuery, and all the tables belong to the same source group. DirectQuery models don't import data. Instead, they consist only of metadata defining the model structure. When the model is queried, native queries retrieve data from the underlying data source, for example, SQL. Connecting to other Power BI datasets and Analysis Services models using DirectQuery (in preview) is possible.
- Composite mode: A composite model comprises tables from more than one source group. The composite models can mix Import and DirectQuery modes or integrate multiple DirectQuery data sources.
A source group is a set of model tables that relate to a data source. There are two types:
- Import — Represents all import storage mode tables, including calculated tables. There can only be one import source group in a model.
- DirectQuery — Represents all DirectQuery storage mode tables related to a specific data source. For example, SQL Server and SAP HANA represent two separate DirectQuery source groups in a model.
Dataset Connectivity Modes
Dataset connectivity modes are often interchanged with the dataset modes. Dataset modes refer to how the tables in a dataset are stored, whereas dataset connectivity modes refer to how a Power BI report connects to a dataset.
Suppose a Power BI report contains the dataset (or is built in the same .pbix file). In that case, the dataset connectivity mode and the dataset mode are the same, i.e., Import, DirectQuery, and Composite. In contrast, the connectivity mode differs when a Power BI report sources data from an Analysis Services model or a Power BI dataset hosted in the Power BI service.
- Live Connection (or Exploration mode): A Power BI report can connect to an Analysis Services model or Power BI datasets using Live connection mode. DAX queries are issued via a live connection to the Power BI dataset or Analysis Services whenever the report is rendered. This mode supports both tabular and multidimensional models. However, no transformations, mash-ups, or modeling capabilities are available with live or exploration connectivity mode.
Data Preparation & Cleansing
All the data presented in reports and dashboards originate somewhere. The dataset author inevitably needs to transform the data into a format suitable for Power BI. Some everyday tasks include:
- Integrating data from multiple inconsistent data sources
- De-normalizing the data to make the dataset ideal for reporting
- Filtering out unnecessary columns or rows
- Handling dirty data, incorrect data, or errors
Power BI offers powerful functionality for cleansing, filtering, transforming, and integrating data using Power Query. Power Query comes with a graphical interface for getting data from sources and a Power Query Editor for applying transformations.
- Power Query Editor in Power BI Desktop: In Power BI Desktop, Power Query Editor is used to connect, shape, and transform data from one or multiple sources. The data is then loaded into the Power BI dataset, ready for further modeling and reporting.
- Dataflows: Dataflows provide a flexible and more straightforward solution to create a simple, centralized data repository like a data warehouse, albeit one that is not as sophisticated as a full-fledged data warehouse. Dataflows are helpful in scenarios where the requirement is to query the source system as few times as possible, to reduce the amount of data preparation work being duplicated across many datasets, and to provide a single storage location for integrating data from multiple sources. In addition, tables in a dataflow can't be browsed, queried, or explored without a dataset. Dataflows are available for all users with a Power BI Pro license, with additional features available in the Power BI Premium or Premium Per User (PPU) workspaces.
- Datamarts: Datamarts is a newer alternative to Power BI Dataflows targeted toward self-service scenarios. Datamarts enables users to collect, transform, store and explore data loaded in a relational and fully managed Azure SQL database. Additionally, Datamarts automatically generate a dataset, which can be used to create Power BI reports and dashboards. Unlike Dataflows, Datamarts provides access to tables through a SQL endpoint. Datamarts require a Power BI Premium subscription or Premium Per User (PPU) license.
Dataflows and Datamarts are created using Power Query Online, a web-based version of the same Power Query engine used in Power BI Desktop.
- Streaming dataflows: Streaming dataflows allow authors to connect to, ingest, mash up, model, and build reports based on streaming, near real-time data directly in the Power BI service. Authors can mash streaming data with batch data if required. The final entity produced is a dataflow, which can be consumed in real-time to create a highly interactive, near-real-time reporting solution. Streaming dataflows can only be created and run in a workspace part of a Power BI Premium capacity or Premium Per User (PPU) license.
Data Refresh
Data stored in the Power BI service must be refreshed for reports and dashboards to stay relevant. How frequently data need to be refreshed depends on the data change rate in the underlying sources and the reporting needs. A rule of thumb is to refresh as less as possible to use the resources efficiently.
For imported tables, queries are sent to the underlying data source to refresh data and metadata. Whereas for DirectQuery and Live Connections, only metadata is refreshed.
The number and duration of refreshes are governed and controlled by the capacity type of the hosted workspace and the refresh method. For example, datasets in shared capacity can be refreshed up to eight times per day with a timeout duration of two hours. In comparison, datasets in Premium capacity can be refreshed up to 48 times per day and a timeout duration of five hours.
Types of Data Refresh
Power BI offers two types of refresh operations for datasets, dataflows, and datamarts.
- Full Refresh: In full refresh, the tables are truncated and loaded fully from the data source. That means processing and loading every single row of data. Full refresh is expensive regarding resource consumption, time, and load on the Power BI infrastructure.
- Incremental refresh: Incremental refresh loads only new and changed data from the source. Incremental refresh helps reduce the time to load larger datasets and the query load on source systems.
Methods for Data Refresh
Several options are available to developers and self-service users to refresh datasets, dataflows, and datamarts. For the sake of simplicity, the refresh methods can be divided into two categories:
Within the Power BI service:
- Refresh now: "Refresh now" triggers an on-demand refresh of datasets, dataflows, and datamarts.
- Scheduled refresh: Schedule refresh is helpful to set up periods at which a dataset, dataflow, or datamart will refresh automatically. The Power BI service starts scheduled refreshes at the specified times on a best-effort basis. The target is to initiate the refresh within 15 minutes of the scheduled time, subject to the load on the capacity used to execute the refresh.
Outside of the Power BI service:
- Re-uploading a .pbix file: Re-uploading a .pbix file replaces the existing dataset in the service, thus effectively changing the data in the underlying tables. As a result, the refreshed data is not necessarily the latest compared to the sources.
- REST API: Power BI REST APIs provide a flexible approach to automate the refresh operations for datasets and dataflows. The REST APIs can be executed via external apps or Power BI cmdlets for PowerShell.
- Refresh using XMLA: Datasets hosted in a Power BI Premium, Premium Per User (PPU), and Power BI Embedded workspace support asynchronous refresh with XMLA endpoints. In addition, XMLA endpoints offer fine-grained control over the dataset refreshes. For example, refresh operations through the XMLA endpoint are not limited to 48 refreshes per day, and the scheduled refresh timeout (five hours) is not imposed.
- Power Automate: Power Automate provides an out-of-box automated refresh action for datasets allowing datasets to be refreshed based on external conditions and requirements. Power Automate is the perfect low-code alternative for REST APIs.
It's worth mentioning that all refresh methods respect the incremental refresh policies except the re-uploading of the .pbix file. Additionally, XMLA endpoints make it possible to selectively refresh one or multiple tables in a dataset, whereas all other methods refresh the entire dataset.
Supporting Components
Power BI provides additional components to help users and developers query and control data. Some of these components require users to have an appropriate license and access rights.
- Data source connectors: Connectors enable Power BI to connect and source data from many different sources. Connectors provide a solution for authenticating and querying data sources. In the case of Microsoft data sources, connectors integrate user authentication and data governance with Azure Active Directory.
- Data lineage: Data lineage is an administrative tool in the Power BI service that shows visual relationships between all the artifacts in a workspace and all its external dependencies. The lineage can help data authors to debug problems and tune performance by showing the links between reports, dashboards, datasets, and dataflows. Users need a Power BI Pro license and appropriate workspace roles to switch to lineage view.
- Gateways: The on-premises data gateway acts as a bridge to provide quick and secure data transfer between on-premises data and several Microsoft cloud services such as Power BI, Power Apps, etc. The gateway enables the Power BI service to query locally hosted databases and other data sources and handles encryption, compression, and data transmission. Gateways can be configured for real-time access or scheduled refreshes. Data gateways are available in two different modes, each for use in a different scenario: Standard (aka enterprise) and Personal mode.
- Deployment pipelines: Deployment pipelines is an administrative tool that provides life-cycle control for managing content. Authors can use the pipelines to control the movement of reports, paginated reports, dashboards, datasets, and dataflows through the development, testing, and production stages. Deployment pipelines are only available with a Power BI Premium subscription.
That's all about data management in Power BI! Take a break and answer our following question. Do you know how to refresh a single (or multiple) table(s) in a dataset without refreshing the entire dataset?
We have the answer here:
Conclusion
Power BI offers enterprise-level reporting and data management capabilities. Power BI deployments can be architectured in a hybrid mode where expert data modelers build the standard and certified datasets for broader usage. In contrast, business users and developers can develop datasets to fulfill their reporting needs.
We briefly discussed the data models, dataset modes, table storage, data preparation, refresh options, and supporting components in the Power BI ecosystem.
Next in series
In the next article, we will discuss report types, report design, and data modeling tools in Power BI.
We have another article about the client applications and sharing options for the end-users, APIs, interfaces, and embedding options available to developers in Power BI.