Simplifying Lineage Mapping in Microsoft Purview: A Python Approach

Published in

BI3 Technologies

5 min readMay 30, 2024

Simplifying Lineage Mapping in Microsoft Purview: A Python Approach

Introduction

In today’s data-driven world, understanding data flow is essential for data integrity, compliance, and decision-making. Microsoft Purview provides robust data governance and lineage tracking. This guide explores creating lineage between dimension and fact tables in Purview by extracting Excel lineage details using Python.

Why Lineage Matters

Before delving into the technical aspects, let’s understand the importance of lineage. Lineage offers a clear view of the data’s journey from its origin to its endpoint. It assists in:

Data Quality Assurance: By tracing the data lineage, organizations can ensure data accuracy and identify any discrepancies or anomalies in the data flow.
Regulatory Compliance: Regulations governing data handling and reporting are strict in many businesses. By offering a clear picture of data transit and transformations, lineage aids in proving compliance.
Impact Analysis: Lineage enables organizations to evaluate the potential effects on downstream systems and reports when modifications are made to data sources or transformations.

Extracting Lineage Details from Excel

Microsoft Purview supports various data sources for lineage tracking, including Excel files. Using Python, we can extract lineage details from Excel files and push them to Microsoft Purview. Here’s a step-by-step guide:

Install Required Libraries: Begin by installing the necessary Python libraries such as pandas, JSON, and requests, which will facilitate data extraction and interaction with Microsoft Purview API.

In this context, two distinct APIs have been utilized: one for generating entities and the other for establishing relationships between existing entities.

Lineage creation can be accomplished by following these steps.

Step -1: Read Excel File

Utilize the pandas library to read the Excel file and extract metadata to construct data for transmitting to the API.

Step -2: Define Entities and Relationships

In the “Entity” sheet, align table names with their respective column names. Meanwhile, in the “Dependency” sheet, describe the relationships between tables by indicating the source and destination table names along with the columns from each table utilized for joining.

Step -3: Push Lineage to Purview

Utilize the Purview API, along with authentication tokens, to push the extracted lineage details to Microsoft Purview.

import requests
import json
import pandas as pd

AZ_TENANT_ID="xxx"
AZ_CLIENT_ID="yyy"
AZ_CLIENT_SECRET="zzz"
EXCEL_PATH="C://Users/abcd/lineage.xlsx"

def get_access_token() -> str:
    # Authentication: Obtain an Azure AD token
    token_url = f"https://login.microsoftonline.com/xxxxxxxxxxxxxxx/oauth2/token"
    token_payload = {
        "grant_type": "client_credentials",
        "client_id": AZ_CLIENT_ID,
        "client_secret": AZ_CLIENT_SECRET,
        "resource": "https://purview.azure.net",
    }
    token_response = requests.post(token_url, data=token_payload)
    access_token = token_response.json()["access_token"]
    return access_token

def calculate_col_count(df):
    table_column_counts = df.groupby("TableName")["ColumnName"].nunique()
    result = {}
    for table, column_count in table_column_counts.items():
        result[table] = column_count
    return result

def generate_guid():
    df = pd.read_excel(EXCEL_PATH, sheet_name="Entity")
    grouped = df.groupby("TableName")
    resulttableguid = {}
    idx = 0
    for group_name, group_df  in grouped:
        idx += 1
        resulttableguid[group_name] = "-" + str(idx)
    return resulttableguid

def create_entity():
    API_ENDPOINT_1 = (
        "https://PURVIEW_NAME.purview.azure.com/datamap/api/atlas/v2/entity/bulk"
    )
    access_token = get_access_token()
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    df1 = pd.read_excel(EXCEL_PATH, sheet_name="Entity")
    calculate_col_count(df1)
    resulttableguid=generate_guid()
    iterationcount = 0
    grouped = df1.groupby("TableName")
    data = {"entities": []}
    # Iterate over each group (each distinct table name)
    for table_name, group_df in grouped:
        table_data = {
            "typeName": "hive_table",
            "attributes": {
                "qualifiedName": f"test_lineage.{table_name}",
                "name": f"{table_name}",
            },
            "relationshipAttributes": {"columns": []},
            "guid": f"{resulttableguid[table_name]}",
        }
        for idx, (index, row) in enumerate(group_df.iterrows()):
            column_data = {
                "guid": f"{resulttableguid[table_name]}{idx}",
                "typeName": "hive_column",
            }
            column_full_data = {
                "typeName": "hive_column",
                "attributes": {
                    "qualifiedName": f"test_lineage.{table_name}#{row['ColumnName']}",
                    "name": f"{row['ColumnName']}",
                    "type": f"{row['DataType']}",
                },
                "guid": f"{resulttableguid[table_name]}{idx}",
                "relationshipAttributes": {
                    "table": {
                        "guid": f"{resulttableguid[table_name]}",
                        "typeName": "hive_table",
                    }
                },
            }
            iterationcount += 1
            data["entities"].append(column_full_data)
            table_data["relationshipAttributes"]["columns"].append(column_data)
        data["entities"].append(table_data)
        # Keeping 20 JSON objects in array for API call similar to pagination
        if iterationcount % 20 == 0:
            response1 = requests.post(API_ENDPOINT_1, json=data, headers=headers)
            data = {"entities": []}
        else:
            response1 = requests.post(API_ENDPOINT_1, json=data, headers=headers)
        return response1
    
def add_relationship():
    mappingarr = []
    colmapfinal = ""
    previous_row = None
    API_ENDPOINT_2 = (
        "https://PURVIEW_NAME.purview.azure.com/datamap/api/atlas/v2/relationship"
    )
    access_token = get_access_token()
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    resulttableguid=generate_guid()
    df2 = pd.read_excel(EXCEL_PATH, sheet_name="Dependency")
    for index, row in df2.iterrows():
        mapping = {"Source": f"{row['SourceColumn']}", "Sink": f"{row['SinkColumn']}"}
        if previous_row is not None:
            if (row["SourceTable"] == previous_row["SourceTable"]) and (
                row["SinkTable"] == previous_row["SinkTable"]
            ):
                mappingarr.append(mapping)
            else:
                mappingarr = []
                colmapfinal = ""
                mappingarr.append(mapping)
        else:
            mappingarr.append(mapping)
        previous_row = row
        colmapfinal += json.dumps(mappingarr)
        datatemplate = {
            "typeName": "direct_lineage_dataset_dataset",
            "guid": f"{resulttableguid[row['SourceTable']]}",
            "end1": {
                "typeName": "hive_table",
                "guid": f"{resulttableguid[row['SourceTable']]}",
                "uniqueAttributes": {
                    "qualifiedName": f"test_lineage.{row['SourceTable']}"
                },
            },
            "end2": {
                "typeName": "hive_table",
                "guid": f"{resulttableguid[row['SinkTable']]}",
                "uniqueAttributes": {
                    "qualifiedName": f"test_lineage.{row['SinkTable']}"
                },
            },
            "attributes": {"columnMapping": colmapfinal},
        }
        response2 = requests.post(API_ENDPOINT_2, json=datatemplate, headers=headers)
        return response2

if __name__ == "__main__":
    create_entity()
    add_relationship()
    print("Lineage creation successful!")

Visualizing Lineage

After executing the Python script outlined earlier, we obtained the following lineage diagram representing the data flow between dimension and fact tables within Microsoft Purview:

Note: Open Microsoft Purview Governance Portal (classic)

Access the portal and click “Data catalog” tab.

2. To access lineage information, simply input the table name into the search catalog search box.

3. Now, lineage diagram of the selected table will be displayed under “Lineage” tab.

*Lineage diagram illustrating the flow of data between dimension and fact tables in Microsoft Purview*

4. You can view the complete list of columns in this table within the “properties” tab under the “column” section.

Conclusion:

So far, we’ve explored automating lineage mapping between dimension and fact tables in Microsoft Purview using Python. By leveraging pandas for data extraction and the Purview API for lineage propagation, organizations can simplify their data governance processes. Running the provided Python script ensures smooth execution, resulting in successful lineage relationships within Microsoft Purview. This automation enhances data transparency and compliance, empowering organizations to maximize the value of their data assets.

About Us:

Bi3 has been recognized for being one of the fastest-growing companies in Australia. Our team has delivered substantial and complex projects for some of the largest organizations around the globe, and we’re quickly building a brand that is well-known for superior delivery.

Website: https://bi3technologies.com/
Follow us on,
LinkedIn: https://www.linkedin.com/company/bi3technologies
Instagram: https://www.instagram.com/bi3technologies/
Twitter: https://twitter.com/Bi3Technologies