Automatically populate your Microsoft Purview’s term definitions using OpenAI APIs

Jorge G
6 min readFeb 9, 2023

--

photography from Unsplash

This is the second article of a series about OpenAI. In the first article, we created a small flask application that was integrated with Microsoft Teams using Outgoing Webhooks.

In this article, we will be creating a small Streamlit application to automatically extract the terms from Purview Glossaries that do not have a definition using the Atlas APIs and propose a potential definition using the OpenAI API.

Organizations often face difficulties in gaining a complete understanding of their data, including what data they have, where it is located, and how to access it effectively. This can result in data lacking context and metadata, making it difficult for business users to find and use relevant information. Additionally, a lack of data governance can result in underutilization of data and inconsistent data security practices.

Microsoft Purview offers a solution to these challenges. It empowers organizations to gain a comprehensive view of their data, while also maintaining control over its usage. With Purview, organizations can discover, catalog, and gain insights into their data estate. Additionally, they can centralize data governance and regulate access to data, improving data security and agility.

Adopting a data governance solution can pose several challenges, including resistance to change from employees, data silos, lack of resources, complexity in managing complex data structures, difficulties in integrating with existing systems, ensuring data privacy and security, and measuring the success of the initiatives. The implementation of data governance often requires changes to existing processes, which can lead to resistance from employees. Additionally, data is often stored in silos and in different formats, making it difficult to gather and standardize for governance purposes. Implementing a data governance solution also requires dedicated resources such as personnel and budget, which can be a challenge for some organizations. Managing complex data structures, relationships, and metadata can also prove to be a challenge for data governance solutions. Integrating data governance solutions with existing systems and data sources can also be complex. Ensuring data privacy and security while implementing data governance can also pose a challenge, as regulations and data protection laws vary across regions and industries. Finally, measuring the success of data governance initiatives and demonstrating their value to stakeholders can be difficult.

The purpose of this small tutorial is to help enhance the adoption of Microsoft Purview by automating certain tasks! We will be aiming to streamline data stewardship tasks by automating them and suggesting potential definitions for data assets.

Step-by-step guide

1. Creating Microsoft Purview API keys

Firstly we will start going to the Azure Portal to create a service principal. In the portal, you will navigate to Azure Active Directory and then to app registrations. There you will create a new app registration as follows:

Note that for the purpose of our demo, the Redirect URL does not have to be a valid endpoint. Once the Application is registered you will have to navigate to and create a new secret. Navigate then to your purview instance and inside the data catalog you will have to add the necessary roles for the service principal we just created, to be able to access the collections that it will be accessing.

2. Deploying your application

In this second step we will be deploying our python streamlit application to propose new definitions for the Purview terms which would not have one. You will find the code for such an application here bellow:

import streamlit as st
import pandas as pd
import openai
import requests
import json
from azure.purview.catalog import PurviewCatalogClient
from azure.identity import ClientSecretCredential
from azure.core.exceptions import HttpResponseError

openai.api_key = "OPENAI_KEY"
openai.api_base = "https://<OPENAI_WORKSPACE>.openai.azure.com/"
openai.api_type = 'azure'
openai.api_version = '2022-12-01'
deployment_id='OPENAI_DEPLOYMENT'

client_id = "PURVIEW_APP_CLIENT_ID"
client_secret = "PURVIEW_CLIENT_SECRET"
tenant_id = "AZURE_TENANT_ID"
reference_name_purview = "PURVIEW_ACCOUNT_NAME"


def get_credentials():
credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
return credentials

def get_purview_client():
credentials = get_credentials()
client = PurviewCatalogClient(endpoint=f"https://{reference_name_purview}.purview.azure.com", credential=credentials, logging_enable=True)
return client

def update_glossary_term_definition(purview_catalog_client,term_guid, definition):
print("UPDATE GLOSSARY TERM DEFINITION FOR GUID: "+str(term_guid))
partial_updates = {"longDescription": definition}
try:
response = purview_catalog_client.glossary.partial_update_glossary_term(term_guid, partial_updates)
st.success('You successfully saved the definition!')
print(response)
except HttpResponseError as e:
st.error('There was an error while saving the definition.')
print(e.message)

def check_column_exists(df, col_name):

if col_name in df.columns:
return True
else:
return False

def populate_dictionary(df):
print("populate dictionary")
for index, row in df.iterrows():
df.at[index,"Definition"] = openai_call("Act as a dictionary and give me the definition of the term "+ row["Name"])
return df


def openai_call(prompt_text):
response = openai.Completion.create(
engine=deployment_id,
prompt=prompt_text,
temperature=0.9,
max_tokens=150,
top_p=1,
frequency_penalty=0.0,
presence_penalty=0.6,
stop=[" Human:", " AI:"]
)
return response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip()



if __name__ == '__main__':
print("GET connection to Purview")
credential = get_credentials()
purview_catalog_client = get_purview_client()
print(purview_catalog_client)
response = purview_catalog_client.glossary.list_glossaries()
try:
response = purview_catalog_client.glossary.list_glossaries()
print(response)
glossary_dict = {}
glossary_name_list = []
term_list = []

for item in response:
glossary_name_list.append(item['name'])
glossary_dict[item['name']] = item['guid']
st.title('Purview Definition Populator')
selected_glossary = st.selectbox('Which glossary do you want to use?', glossary_name_list)
st.write('You selected glossary:', selected_glossary)

selected_glossary_guid = glossary_dict[selected_glossary]

response = purview_catalog_client.glossary.get_detailed_glossary(glossary_guid=selected_glossary_guid, include_term_hierarchy=True)

for item in response['terms']:
term_definition = response["termInfo"][item['termGuid']].get('longDescription', "No description")
term_list.append({"Name":item['displayText'], "Guid":item['termGuid'],"Definition":term_definition})
df = pd.DataFrame(term_list)
st.header('Non populated glossary')
st.dataframe(df)
df = df[df['Definition'].isin(["No description", "<br>"])]
df["action"]=True
popoulated_dictionary = populate_dictionary(df)
st.header('Populated glossary')

col1, col2, col3 = st.columns([1, 3, 1])
col1.subheader("Name")
col2.subheader("Proposed Definition")
col3.subheader("Action")
for index, row in df.iterrows():
col1, col2, col3 = st.columns(3)
with col1:
st.write(row["Name"])
with col2:
st.write(row["Definition"])
with col3:
disable_status = row['action']
button_type = "Validate" if disable_status else "Validated"
button_phold = col3.empty()
do_action = button_phold.button(button_type, key="_btn"+str(row["Guid"]))
if do_action:
update_glossary_term_definition(purview_catalog_client,row["Guid"],row["Definition"])
button_phold.empty()


except HttpResponseError as e:
print(e)


Once you run the python file you should get a web interface where you can select the glossary you will want to work with, the definitions on this one, and a table with potential definitions for the terms which do not contain one.

If the proposed definition fits the term you can validate it and it will be automatically updated in your purview instance.

For the purpose of this article, we used OpenAI to populate definitions for glossary terms but this could be easily extended and used to autocomplete other types of information, such as synonyms and antonyms, or for other types of assets, such as data assets, using as context for the prompts, the potential metadata extracted by the Purview scanners.

3. Conclusions

This is just an example of how integrating OpenAI into another tool can help efficiency and processes. Automating processes can play a crucial role in promoting the wider adoption of data governance tools such as Microsoft Purview. Automating data management processes using Purview can improve data discovery and classification, reducing the time and effort required to manually manage large amounts of data. With automated data mapping and cataloging, data stewards and analysts can easily understand data relationships and lineage, which can help in making informed decisions about data usage and protection. Automating the creation of definitions for the business glossary terms can improve processes and help data stewards and business focus on the data. Additionally, Purview allows you to automate data access and help organizations enforce data security and privacy regulations by using policies, reducing the risk of data breaches and non-compliance fines. In addition, Purview’s integration with other Microsoft products such as Azure OpenAI, Power Platform, and Dynamics 365 can further streamline data management processes, reducing the time and effort required to manage data across the organization.

Thank you for reading. Do not hesitate to reach out if you would have any questions!

--

--

Jorge G

Cloud Solution Architect in Data & AI at Microsoft