Injecting lineage and attributes into Microsoft Purview

Nicholas Hurt

Published in

Microsoft Azure

14 min readJul 27, 2022

Introduction

“How do I easily find the data I’m looking for?”

“How do I know the data comes from a reliable source?”

These are common questions we hear from data consumers, particularly more so as the organisation starts to scale it’s data platform with numerous data sources and data producers. Identifying the right datasets across numerous data stores or a heap of catalog search results can be a challenging and time consuming process. Enriching these consumption-worthy assets with additional business and technical metadata will form part of the solution. This blog describes and demonstrates how to easily add lineage and attributes in order to improve the data consumer’s experience.

Current capabilities…

Automatic lineage collection in Microsoft Purview is currently only supported by certain ADF activities and Synapse activities. No doubt this will expand over time, however you may wish to capture lineage from other engines such as Synapse Spark or Databricks. Whilst there is a Spark based lineage collector, as well as the Azure Databricks to Purview Lineage Connector based on Open Lineage, you can alternatively inject your own lineage programmatically — for every transformation or process which creates a new dataset there will need to be some additional code run in order to create the relationship or linkage between inputs and outputs. This solution requires data producers to become diligent about adapting their current and future (ETL) processes to “publish” this critical metadata to Purview, and this post aims to demonstrate one of the ways this can be achieved.

Classifications are good way of describing technical metadata but often consumers need the business understanding/metadata to compliment this, in which case, you may wish to assign attributes or tags to the data. The workflow engine in Microsoft Purview may very well evolve to automate this in the future but for the time being it will have to be a manual applied or a scripted process depending on the business rules/logic.

This blog post will show you how to inject both of these metadata elements into your Purview catalog, also known as the Purview Data Map. The full source code for this walkthrough can be found in this Github repository which includes notebooks with markdown comments, but for the remainder of this post we’ll go into a bit more detail. It assumes you have some understanding of Python and how to interact with the Purview using SDKs/APIs, which includes the prerequisites such as creating a service principal and assigning necessary roles.

Provenance and Process

To inject lineage between datasets which reside in ADLS (data lake) for example, we will use the superb Pyapacheatlas package by Will Johnson. Adapted from one of the sample notebooks in his repository, the notebook used in this blog can be found here. Ideally perform a git clone on the link to the repo above in order to download both notebooks used in this post.

To run the code, we’ll use Python in Synapse Spark, as this might be a common point of entry where data producers/engineers might run their data transformations and will know the inputs and output datasets to their process(es).

When using Python distributions bundled in Apache Spark one typically needs to attach the libraries either to the cluster or the session otherwise this happens….

There are a number of ways to manage Python libraries in Synapse Spark however a simple approach is to create a requirements file…

and upload it to the spark pool…

Once you have opened/imported the sample notebook into a Synapse Spark notebook, enter the Purview account and authentication details in cells 2 and 3. In summary, you’ll need to define (and optionally create) both the source and target assets/entities (using upload_entities ) in cell 4, then in the following cell (5) create the relationship (lineage) between these inputs and outputs. If either the inputs or outputs don’t exist at the time, then no relationship can be created. If either exist, such as the input asset, it is not affected by this operation.

From a timing perspective, typically what may happen is that the input assets already exist in Purview, discovered via a scan, and a Spark job is going to make use of these to produce new assets or target outputs. For existing (scanned) assets, ensuring the fully qualified name is set correctly in the AtlasEntity section of the notebook (as shown below) avoids creating a duplicate asset. The qualified name can be found in the overview tab. Additionally, set the GUID which can be found in the URL when viewing the asset in the Purview UI. Alternatively this information can be obtained via the discovery_query API or the get_entity method which would make more sense if it is part of a scripted approach. Enter the entity name and type — for a full list of type names you can be obtained by using the GetEntityDefinition API.

Below is an image of the example input and output assets in the notebook. The input json file is going to be transformed by Spark into a set parquet files, which we know will be detected as a resource set. At this stage a scan has not been run on the input location therefore we are using a new negative GUID to let Purview know this is a new asset.

As mentioned, the upload_entities call ensures that both the inputs and outputs are created in the Data Map so that a relationship can be formed between them. For assets that haven’t been scanned and don’t yet exist in the Data Map, it is simply creates a “placeholder” entity which is visible via the Purview UI. Then we can make a call to define the lineage artifact using the AtlasProcess class passing these input(s) and output(s) — see image below of cell 5. Make sure to specify a unique process qualified name (process_qn) for each new relationship otherwise you will end up overwriting an existing one with the same name, and existing lineage for the existing process will be lost! This means you may want to think about appending the process_qn with a unique GUID or task/backlog item number. The process type name should be “process” unless you wish to create your own process type using the typedefs API. For reference here is a type definition which could be used for Databricks types. The name of the process can be anything which describes your process. Finally the upload_entities call creates this relationship, making the lineage visible in the Purview UI.

Once you populated and run cells 1-5 then you should see lineage between the assets. If not, Hit Refresh!

One can add additional attributes such as a description and owner/experts using their object IDs from AAD. This will enable your data consumers to know who to contact if they have questions about the process. An example is shown below which defines whether the contact is an expert or owner, the AAD object ID and an info field, which can be any text string such an extension number for example.

process = AtlasProcess(name="Synapse Spark - process raw",
attributes={"description":"Spark job to transform raw files into standized format"},
contacts={"Expert":[{"id":"aaaaaaa-0000-000a-aa00-0a000a0000a0","info":"ext 3234"}],"Owner":[{"id":"aaaaaaa-0000-000a-aa00-0a000a0000a0","info":"ext 2553"}]},

Now you have seen how to inject lineage programmatically, it is time to think about how this could be included as a repeatable step in your business’ data pipelines. Ideally you may want to convert these cells 1–5 into a library so that they can be called in a single function call, passing the required information as parameters. As part of a data pipeline, the engineer will most likely be referencing these (fully qualified) data asset paths for source and target, unless they are using mounts in which case these will need to be translated (ideally programmatically) to the full path otherwise you will end up with duplicate assets in the data map — (1) the injected asset and (2) the scanned asset.

In reality you may have a number of source to target transformations, and often multiple source datasets to produce a single target dataset, which is more representative of a full analytics (ETL) pipeline. You will see an example of this in the last cell of the notebook. Define multiple input sources (and possibly multiple output targets) and for each new target asset being produced, simply repeat the process of defining inputs/outputs, AtlasProcess and making the upload_entities call. You may end up with something like this to represent a more realistic pipeline which standardises some data before joining them to produce a final target dataset.

Attributes to enrich your metadata

Managed Attributes in Purview are a group of related key-value pairs that can be assigned against an asset to provide additional business context and understanding. The first step is to create the definition of these related attributes before they can be populated and assigned to an asset. Whilst one can do all of these steps using the Purview UI (public preview at the time of writing), this post will demonstrate how to use the associated APIs in order to automate the process.

To create the definition of the group of attributes, we use the typedefs API, and then to populate the attributes against a particular asset we use the business metadata APIs. Business metadata can also be created using pyapacheatlas as shown in this sample, however, the next section will show you how to utilise the Atlas 2.2 APIs directly. Open the autotagger.ipynb notebook in VS Code or Synapse and follow along as we walk through the cells.

Our scenario is based on a common requirement - we need to add additional context to the assets based on some related information, such as source system or characteristics of the path (fully qualified name). For example, you might have a taxonomy system for your data lake whereby a folder name may include an acronym which defines the business unit, region or client. For example, this is what an asset may look like in Purview after the initial scan…

The path (fully qualified name) for this collection of spark files (resource set) has three mysterious acronyms, gold, baa and dz. The attributes we want to apply should be the full business terms, rather than the acronyms, so that data consumers are more likely to understand and find the data they’re looking for using words they’re familiar with. These acronyms will require some sort of translation or lookup of the acronym and the look up information could reside in an application, flat file or database. Data stewards could manually look these up and populate the asset’s attributes using the Purview UI if they wish, but as all the information is available programmatically, let’s automate the process...

The associated notebook provides boilerplate code (cells 1 and 2) for both file or database lookups, however, if you are using a database table you will need to ensure the pyodbc driver/library is installed and you will need to decide whether to use database or AAD authentication as the connection string will be different.

Cell 3 requires the account details and service principal credentials to authorise and interact with the Purview (Atlas) APIs. In cell 4 we define the business metadata group, in our example we named it “Data Product” with two main attributes, one for client and the other for market. Additional attributes such as data domain, product grouping or lifecycle (stage) may help define the data product, as show in our example. These related attributes can be thought of as placeholders, which will later be used when assigning key value pairs attributes to assets grouped by the business metadata definition.

As we are not able to assign/apply these attributes during the scan (although this may change in future within the Workflow feature), for now, we need to retrospectively (after a scan) identify the assets which we wish to tag in order to start populating these attributes and assigning them.

In the sample notebook we have defined a search query which returns all the resource sets below a certain path in ADLS within a specific collection.

Assume we want to tag all assets ready for consumption and we know that these are stored in this “gold” container so all of these should be tagged with the Stage attribute set to “Final”. The logic loops through the assets found below the gold container level and inspects two levels deep in to the folder hierarchy, one for client and the next level down for market. At each level it performs the associated lookup using the dictionary objects for market and client created in the first step.

Iterate through each asset and it’s folder hierarchy

Then it populates any valid lookup results as key value pairs to the business metadata group called “Data Product” and then assigns it to the asset using it’s GUID. For example, in the screenshot above of asset “dz”, the first folder is the client name which is baa, so the client attribute is set to the result of the lookup which is Bank of ACME, and the second folder is the market which is dz so market is set to Algeria.

Apply tags using the business metadata API

Now we see these attributes populated in the Purview UI!

Note that the business metadata API is called per asset (GUID) which could take some time if you have a large number of assets. At the time of writing there was no bulk approach equivalent API which would make the process more efficient. Also note that making numerous API calls may also mean the Elastic Data Map scales up based on increased throughput whilst the script is running.

Now that you have programmatically added attributes to the assets, your data consumers can view them in the new managed attributes section in the overview tab. These attributes could be further enriched by data stewards/owners simply by editing the asset’s metadata directly in the Purview UI and populate/adding additional attributes. Alternatively, data producers can integrate this process into their data pipeline by creating a placeholder asset as described in the previous section and setting the attributes. If the qualified path is set correctly for the placeholder asset, the subsequent scan would enrich the existing placeholder asset rather than creating a duplicate. Ensure that you have provided the appropriate permissions (such as data curator role) at a collection level to only the data owners/producers, rather than everyone to avoid accidental or inaccurate metadata from creeping in.

A fully curated set of managed attributes

These attributes could be particularly useful when defining attributes about a data product, but how would this help data consumers find the right assets in the first place? In the future there may be filters in the search results page, but for now, simply by adding these keywords to the search term will boost the ranking of these tagged assets, so long as you switch to the Relevance sort order rather than sorting by name (see the drop down in the top right corner).

Here’s an example; we started by searching for a particular client called BANK OF ACME and typing in ACME which returned a heap of results…

“Bank of ACME” would not be much better as we know the data producers use acronyms, and what we really wanted was the Bank of ACME clients in Algeria (again needs a country code which we, the data consumer, could not remember) in the Marketing domain. So long as the metadata has been assigned appropriately, entering all of these terms boosts the correctly tagged asset to the top of the search results. Our asset also happens to be marked as certified, normally meaning the data producer/owner has indicated that this asset meets a certain level quality, is consumption-worthy for producing analytics.

Although without the attributes against that asset, we would have had to have known all the acronyms, scanned all the qualified paths through all the results, this one being BAA and DZ — so potentially a time consuming process! Anyone working with a certain large ERP system may know this dilemma only too well.

Attributes are everywhere…

One aspect to attributes that may not be immediately apparent, is that they can be applied to any asset type in Purview. Even the lineage entity we automated above can benefit from attributes. Here’s a scenario… assume I am a data consumer who wants to understand what the process does and what transformations occur. I am not technical so I don’t want to try to look at code or open a pipeline diagram, I would rather use a business tool such as Azure DevOps, ServiceNow, or a Wiki to find information relating to this development work. Using attributes, we can now empower engineers and developers to associate their transformation logic/lineage process with a work item number, not only to save others time, but also to save them time from having to explain what their process does in great detail to a multitude of folk around the business!

Just ensure the associated asset type (applicableEntityTypes) is included in the definition of attribute you’re trying to apply, for example in our sample code above we used

but in the UI the full list of types can be found. When a custom type has been used, simply set it to referenceable.

Conclusion

In this post we have discussed two techniques which inject additional metadata against data assets in order to help data consumers find and trust consumption ready datasets. Depending on your data sources and transformation engine, these metadata artefacts may become more natively integrated into the Purview service in the future. Whilst incorporating these steps programmatically or manually may seem like an additional overhead or burden for data owners/producers, it is part of becoming a data driven culture — one where everyone in the organisation needs to take ownership for ensuring their high quality data products are easily discoverable and trustworthy for the benefit of data consumers that rely on making data-driven decisions.

— — — — — —

This blog has been written from experience and to the best of our knowledge, but if you spot any errors in this text or code, or if you have any improvements you would like to suggest, please feel free to leave a comment!