Optimizing Power BI Operations: A Comprehensive Guide to Efficient Scheduling and Orchestration

Published in

Engineering at Bajaj Health

7 min readApr 10, 2024

Every Power BI developer endeavors to ensure the timely delivery of their reports, facilitating the seamless monitoring and operation of the business. This responsibility becomes particularly crucial in environments where real-time data synchronization is imperative.

Developers often manually schedule datasets in an optimized manner. However, as the number of datasets grows, issues related to refreshing become apparent — ranging from timeout errors and prolonged refresh durations to database downtime due to excessive load and inadequate data availability. These challenges can significantly impact business operations.

Hence, there is a compelling need for a scheduling or orchestration process to efficiently manage computational resources and achieve high throughput. Let’s explore some reasons highlighting the necessity of an orchestration process.

Why do we need orchestration in Power Bi?

For easy understanding, think Power BI Service as a computational service hosted on a cloud platform. It possesses finite storage capacity, RAM, CPU, and other computational resources, all of which can reach their maximum limits. Once these limits are reached, further computation becomes impossible. While the option to increase computational resources exists, it comes with an augmented cost. The primary objective is to enhance system efficiency within the existing limitations.

Moreover, Power BI Service relies on database engines like Synapse Analytics and MySQL server. Even with expanded storage, CPU, or RAM for the service, challenges and bottlenecks may persist due to dependencies on these databases.

Additionally, when attempting to schedule dataset refreshes through the Power BI Service UI, the available options for selecting minutes are limited to 00 or 30. This results in poor utilization of time and computational resources. For instance, if scheduling two datasets, A and B, with the intention of initiating B after A completes its refresh, the only choice is to schedule them consecutively with a 30-minute interval. If A completes its refresh in 10 minutes, the subsequent 20 minutes remain idle.

A good scheduling process can prove beneficial in scenarios where refreshing multiple datasets across various workspaces simultaneously is required. Manual triggers for each dataset can easily lead to timeouts, making automated scheduling a more efficient solution.

What is the solution?

Microsoft offers an array of APIs serving various purposes, including the maintenance of dataset metadata, user access management, refresh triggering, dataset downloads, and more. Utilizing these pre-built APIs allows the creation of a customized program for efficiently orchestrating and scheduling Power BI dataset refreshes.

However, before diving into the technical complexities, let’s first clear our objectives. The aim is to develop a program capable of triggering the refresh of multiple datasets across various workspaces simultaneously or sequentially.

To address this challenge, we can break it down into two sub-problems. First, there is a need to maintain a table containing information about datasets requiring refresh. This table would include metadata such as dataset ID, name, the workspace or group to which it belongs, and other relevant details.

The second sub-problem involves utilizing the table established in the first step and initiating the refresh of these datasets based on a defined selection logic.

Conquering Sub-Problem 1

Our initial step involves the establishment of a master table to store fundamental information about each dataset, along with additional flags facilitating the enabling or disabling of refresh. Let’s designate this table as “schedule_master.” For the sake of simplicity, we’ll adopt the following schema:

- datasetId (Primary Key)

- datasetName

- workspaceId

- isRefreshEnabled (Flag to indicate refresh enable or disable, Default: 1)

- createdAt

Our objective is to populate and maintain this table. As it is a master-type table, only one entry per dataset is allowed, hence the datasetId is designated as the primary key.

The process for table maintenance includes the following steps:

1. Retrieve the list of workspaces.

2. Iterate through each workspace and retrieve the list of datasets within them.

3. For each dataset, verify its existence in the “schedule_master” table.

4. If the dataset already exists, proceed to the next dataset. If not, create an entry for that dataset in the master table with all the relevant details and move on to the next dataset.

5. During the creation of the master table, set a default value for the isRefreshEnabled flag based on whether you wish to enable refresh for new datasets or . If there is a need to modify the flag for a specific dataset, it can be done at this stage.

This master table will subsequently serve as input for addressing the second sub-problem.

APIs that can be used in sub-problem 1

1. To fetch list of workspaces

Sample Request using python:

import requests
workspace_url = 'https://api.powerbi.com/v1.0/myorg/groups'
rawJson = requests.get(url=workspace_url, headers=header).json()
print(rawJson)

Sample Response:

{
  "value": [
    {
      "id": "f089354e-8366-4e18-aea3-4cb4a3a50b48",
      "isReadOnly": false,
      "isOnDedicatedCapacity": false,
      "name": "sample group"
    },
    {
      "id": "3d9b93c6-7b6d-4801-a491-1738910904fd",
      "isReadOnly": false,
      "isOnDedicatedCapacity": true,
      "capacityId": "0f084df7-c13d-451b-af5f-ed0c466403b2",
      "defaultDatasetStorageFormat": "Small",
      "name": "marketing group"
    },
    {
      "id": "a2f89923-421a-464e-bf4c-25eab39bb09f",
      "isReadOnly": false,
      "isOnDedicatedCapacity": true,
      "capacityId": "0f084df7-c13d-451b-af5f-ed0c466403b2",
      "defaultDatasetStorageFormat": "Large",
      "name": "contoso",
      "dataflowStorageId": "d692ae06-708c-485e-9987-06ff0fbdbb1f"
    }
  ]
}

Source: https://learn.microsoft.com/en-us/rest/api/power-bi/groups/get-groups

In the response json, “id” is the workspaceId.

2. To fetch the list of datasets in each workspace:

Sample Request using python:

import requests
dataset_url = 'https://api.powerbi.com/v1.0/myorg/groups/{group_id}/datasets'
rawJson = requests.get(url=dataset_url.format(group_id=workspaceId), headers=header).json()
print(rawJson)

The request URL needs workspace id / group id as input and returns the json object containing all datasets present in that workspace.

Sample Response:

{
  "value": [
    {
      "id": "cfafbeb1–8037–4d0c-896e-a46fb27ff229",
      "name": "SalesMarketing",
      "addRowsAPIEnabled": false,
      "configuredBy": "john@contoso.com",
      "isRefreshable": true,
      "isEffectiveIdentityRequired": false,
      "isEffectiveIdentityRolesRequired": false,
      "isOnPremGatewayRequired": false
    }
  ]
}

Source: https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/get-datasets-in-group

Conquering Sub-Problem 2

The second sub-problem revolves around initiating the refresh for datasets marked as refresh-enabled in our “schedule_master” table. This can be accomplished through three approaches:

1. Serial Triggers: Refreshing datasets one by one, waiting for the completion of the previous one before triggering the next.

2. Parallel Triggers: Simultaneously triggering refresh for all datasets.

3. Hybrid Trigger: A combination of serial and parallel triggers.

While serial triggers can speed up the refresh of a single dataset, they are not recommended due to under utilization of computational resources. On the other hand, parallel triggers pose challenges highlighted in the “Why Do We Need Scheduling in Power BI” section of this blog. Therefore, a hybrid trigger, combining aspects of both serial and parallel triggers, is deemed more effective.

For a hybrid trigger, it is essential to define the maximum number of datasets to be refreshed concurrently. Let’s assume a limit of refreshing four datasets simultaneously.

The next step involves formulating a logic or algorithm to determine which four datasets to refresh. This selection can be based on factors such as dataset size, average refresh time, priority, and others. Here, we opt for the average refresh duration of datasets over the past seven days. An API can be employed to fetch the refresh history of a dataset and calculate the average refresh duration using historical data.

Subsequently, a refresh queue is prepared, containing refresh-enabled datasets sorted in ascending order of their respective average refresh duration. If the queue comprises four datasets or fewer, all of them are triggered. If there are more, the first four (two with the least average duration, one in the middle, and one with the maximum average duration) are triggered.

The system verifies if the triggered datasets have finished refreshing every one or two minutes. If successful, the next corresponding dataset is triggered; otherwise, give it another minute or two and try again. For instance, if the first dataset with the least average duration completes its refresh while the others are still in progress, the next least dataset is triggered, and the wait continues for the remaining datasets.

Upon exhaustion of the refresh queue, a status email is dispatched, detailing how many datasets were eligible for triggering, how many were actually triggered, and the number that completed the refresh.

This end-to-end process enables the seamless monitoring and orchestration of refresh triggers for a large number of datasets. In this implementation, concurrent refresh is set to a maximum of four datasets at a time, a parameter easily adjustable based on specific requirements.

APIs that can be used in sub-problem 2

1. To trigger refresh of a dataset

import requests
triggerRefresh_url = 'https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes'
response = requests.post(url = refreshHistoryUrl.format(groupId = workspaceId, datasetId = datasetId), headers=header)
print(response.headers)

Sample response header

*Source:* *https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/refresh-dataset-in-group*

Request Id returned in the response header is the unique identifier for the triggered refresh. It can be further used to check refresh status.

Important reminder: When utilizing an access token for authentication with Power BI, it is crucial to regenerate and authenticate again once the token expires, especially if the total refresh time of all datasets surpasses the lifespan of the access token.

Reference Links

1. Power BI APIs : https://learn.microsoft.com/en-us/rest/api/power-bi/

2. Power BI Documentation : https://learn.microsoft.com/en-us/power-bi/

Optimizing Power BI Operations: A Comprehensive Guide to Efficient Scheduling and Orchestration

Why do we need orchestration in Power Bi?

What is the solution?

Written by Akash Kumar Singh