Hacking Power BI lineage: which reports depend on this table?
Most Data Analytics or BI teams publishing reports or dashboards within their company in Power BI start with a situation like this…
… and end up in a workspace like this:
You may end up with many Power BI reports each making use of different tables or data sources. Then, one day, a developer comes and says:
“I might change the type of SubscriptionType column in Users table from Int to String. Which reports are impacted by such change?”
When the number of reports in a Power BI workspace is getting large, it gets complex answering such a simple question. How many reports depend on Users table? Do we need to open each report and check the Transform Data section? That is extremely time consuming and prone to manual mistakes. Can we automate such task?
Limits of Power BI Lineage
Power BI has a section dedicated to Data Lineage. This section summarizes
- which reports depend on which datasets
- which datasets depend on which data sources (eg SQL database, CSV files, etc)
Unfortunately, this section cannot give us the answer we are looking for. We read for example which SQL database a report depends on. However, we cannot read if a certain report depends on a specific SQL table inside that database.
Exploiting Power BI Admin API
This is the moment when we need to do some hacking and start exploiting the Power BI APIs via Powershell (you can use any client of your choice of course to call the API). Please, note that you need an admin role on Power BI to make such API calls. First, install the cmdlet MicrosoftPowerBIMgmt (see docs).
Install-Module -Name MicrosoftPowerBIMgmt
The first step is to authenticate our Powershell session via
Login-PowerBIServiceAccount
We now want to call the Power API endpoint (see docs) that gets the metadata of a workspace. The route of the API is admin/workspaces/getInfo. We will set the following values for the optional query parameters
admin/workspaces/getInfo?lineage=True&datasourceDetails=True&datasetSchema=True&datasetExpressions=True
The body of the HTTP call to the API needs a workspaces field as an array of strings. Each element of the array is the GUID of a workspace we want to get the metadata for, example
"workspaces": ["c56a4180-65aa-42ec-a945-5fd21dec0538"]
So, this Powershell script will get a token to authenticate to the API. It will make a first call to start the scanning.
# these are the Power BI API endpoints we're going to call
$startScanUrl = "https://api.powerbi.com/v1.0/myorg/admin/workspaces/getInfo?lineage=True&datasourceDetails=True&datasetSchema=True&datasetExpressions=True"
$statusUrl = "https://api.powerbi.com/v1.0/myorg/admin/workspaces/scanStatus/"
$getScanInfoUrl = "https://api.powerbi.com/v1.0/myorg/admin/workspaces/scanResult/"# getting the token for authentication
Login-PowerBIServiceAccount
$headers = Get-PowerBIAccessToken$body = "{'workspaces': ['c56a4180-65aa-42ec-a945-5fd21dec0538']}"
$headers.Add('Content-Type', 'application/json')# http call to trigger the scan
$startScan = Invoke-RestMethod -Method Post -Uri $startScanUrl -Body $body -Headers $headers
Now, the scan of the workspace metadata just started. We want to wait for it to complete. We can use the /admin/workspaces/scanStatus/ route to check the status of the request.
### Wait for Scan to Succeed
$headers.Remove('Content-Type')
$statusUrl = $statusUrl + $startScan.id
do {
$statusScan = Invoke-RestMethod -Method Get -Uri $statusUrl -Headers $headers
Start-Sleep -Seconds 3
}
while ($statusScan.status -ne "Succeeded")
Once the scan is completed, we can actually collect the results. Again, we call a Power BI API with the route /admin/workspaces/scanResult/. We’ll save the response in a results.json file.
# once the scan is done, we can store the results
$statusUrl = $getScanInfoUrl + $startScan.id
$infoResults = Invoke-RestMethod -Method Get -Uri $statusUrl -Headers $headers
$infoResults.workspaces | ConvertTo-Json -Depth 10 | Out-File -FilePath .\results.json
What does the response look like? Let’s look into our results.json file
[
{
"id": "c56a4180-65aa-42ec-a945-5fd21dec0538",
"name": "OurPBIWorkspace",
"description": "Awsome BI reports",
"type": "Workspace",
"state": "Active",
"isOnDedicatedCapacity": false,
"reports": [...]
"dashboards": [...]
"datasets": [...]
"dataflows": [...]
"datamarts": [...]
}
]
The response is an array of WorkspaceInfo objects describing the workspace and the metadata collected during the scan. The datsets field is an array of WorkspaceInfoDatasets objects. It lists each Dataset in the Power BI workspace with metadata like name, creation date, etc. Why is it important for data lineage?
The WorkspaceInfoDatasets object has a tables field that contains an array of the tables part of the dataset. For each table, the source.expression field is what we are interested in. This field contains the full Power Query M expression to load and transform the data. Example:
"source": [
{
"expression": "let\n Source = myserver.database.windows.net\", \"mydatabase\", [Query=\"SELECT COUNT(DISTINCT UserId) CountOfUsers)FROM Users\"])\nin\n Source"
}
]
Why is this relevant at all? Do you remember the original question from our developer?
“I might change the type of SubscriptionType column in Users table from Int to String. Which reports are impacted by such change?”
By making a simple CTRL+F search on this JSON file, we can quickly find the Power BI datasets where the Users table is used. It can save us plenty of time of manual research throughout the .pbix files of every report.
Do you want to hear more about Flowe? Look at all the open positions!