Data Governance 3.0: Augmenting your Data Governance experience with Purview and OpenAI

Jorge G
12 min readJul 4, 2023

--

Image generated with the definition of data governance

Introduction

In the era of data-driven decision-making, data governance has evolved from mere regulatory compliance to a strategic initiative that fuels informed decision-making. In a previous exploration, we delved into the potential of OpenAI APIs to automate the population of Microsoft Purview’s term definitions, thereby enhancing the efficiency of data governance tasks. Today, we’re embarking on a journey beyond the conventional — welcome to Data Governance 3.0, where we seamlessly integrate data governance tools with Large Language Models (LLMs) with their ability to understand and generate human-like text, are at the forefront of this revolution, automating a multitude of tasks and enhancing the user experience.

LLMs, such as GPT-3 by OpenAI, have revolutionized the way we perceive automation. With their ability to understand and generate human-like text, they have opened up a myriad of opportunities to automate tasks that were traditionally manual and time-consuming. Let’s delve into how LLMs can redefine the landscape of data governance.

Governance 3.0 represents a paradigm shift in the way organizations manage and govern their data. It’s not about replacing traditional governance approaches, but rather about enhancing them. Governance 3.0 leverages the multitude of APIs and SDKs offered by existing governance systems, such as Purview, to extend their base functionalities. Coupled with the power of Large Language Models and artificial intelligence, this approach allows organizations to automate tasks, integrate systems, and improve efficiency. The beauty of Governance 3.0 lies in its ability to augment existing systems, ensure regulatory compliance, improve data quality, and facilitate efficient data management without the need for disruptive overhauls.

Recommending Data Asset Owners: The AI Perspective

LLMs can analyze usage patterns and access rights to recommend the most suitable data asset owner. This process involves analyzing who frequently accesses and modifies a data asset, who has the necessary permissions, and who has the relevant expertise based on their role or past projects.

For instance, if a particular user frequently accesses and updates a data asset, and their role within the organization aligns with the nature of the data, the LLM might recommend them as the data asset owner. This recommendation can then be reviewed and approved by data stewards or managers, ensuring a human in the loop for final decisions.

This AI-driven approach to assigning data asset ownership ensures accountability and fosters responsible data stewardship. It also helps to keep the data governance framework up-to-date, as data asset ownership can be reassessed and updated as roles change or new data assets are created. This not only enhances the efficiency of data governance processes but also ensures that data assets are managed by the most suitable individuals, thereby improving data quality and trust.

Securing Data: Automated Review of Accesses and Policies

LLMs can interpret security policies, enabling them to automatically review and flag potential access violations. This involves analyzing user roles, access patterns, and the sensitivity of data assets to determine if access rights align with established security policies.

For instance, if a user’s role doesn’t typically require access to a sensitive data asset, but the user frequently accesses it, the LLM might flag this as a potential violation. Similarly, the LLM could recommend modifications to security policies based on observed access patterns and evolving business needs.

This proactive approach not only enhances data security and compliance but also helps to maintain the principle of least privilege, ensuring that users have access only to the data they need. By automating the review of accesses and policies, we can maintain a robust and secure data governance framework in the face of evolving data landscapes and regulatory requirements.

Context-Aware Translation: Bridging Language Barriers with AI

In the realm of data governance, language barriers can pose significant challenges, especially for global organizations operating across different regions. This is where the power of LLMs comes into play, enabling context-aware translation that goes beyond literal word-for-word translation. By understanding the context of the text, LLMs can provide more accurate and meaningful translations, ensuring that the essence and nuances of the original text are preserved. For instance, an LLM can translate complex technical definitions into multiple languages while maintaining the integrity of the technical terms and concepts and taking into account any necessary metadata. This capability can significantly enhance cross-cultural collaboration and understanding within an organization, making data governance more inclusive and effective.

Enriching Glossary Terms with Automated Definitions

Building on our previous work, we can extend the use of LLMs to generate definitions and other types of metadata for a vast array of glossary terms. As demonstrated in our previous article, we utilized OpenAI APIs to automatically populate Microsoft Purview’s term definitions. This not only reduces the manual effort required by data stewards but also ensures a consistent understanding across the organization.

Automated Entity Linking: Bridging Glossary Terms and Data Assets

One of the most powerful applications of Large Language Models in data governance is automated entity linking. This process involves identifying relevant connections between glossary terms and data assets, creating a more holistic and interconnected data governance framework.

With automated entity linking, LLMs can analyze the context and content of data assets and link them to the appropriate glossary terms. This not only enhances the metadata of the data assets but also enriches the glossary terms with real-world examples and applications.

For instance, a data asset containing customer transaction information could be automatically linked to glossary terms like “Customer ID”, “Transaction Amount”, or “Purchase Date”. This provides a direct connection between the theoretical definition of the glossary term and its practical implementation in the data asset.

This level of automation significantly reduces the manual effort required to maintain and update these links, ensuring that the data governance framework remains up-to-date and relevant as new data assets are created and existing ones are updated. Moreover, it provides users with a more comprehensive understanding of their data landscape, facilitating more effective and informed data usage and decision-making.

Tracing the Lineage: From Code to Insights

LLMs can comprehend code, a skill that can be leveraged to identify the lineage of your custom applications. This can simplify the complex task of tracking data transformations across pipelines and notebooks.

For instance, consider an SQL Stored Procedure that extracts data from multiple tables, performs transformations, and then loads the results into a different table. Traditionally, understanding the lineage of this process would require manual inspection of the code and a deep understanding of the database schema.

CREATE PROCEDURE update_customer_orders
AS
BEGIN
-- Extract data from the Customers and Orders tables
SELECT Customers.CustomerName, Orders.OrderID
INTO #TempTable
FROM Customers
JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- Perform a transformation on the data
UPDATE #TempTable
SET CustomerName = UPPER(CustomerName);

-- Load the results into the CustomerOrders table
INSERT INTO CustomerOrders(CustomerName, OrderID)
SELECT CustomerName, OrderID FROM #TempTable;

DROP TABLE #TempTable;
END;

Here is an example of how Azure OpenAI extracted the lineage information and presented it in JSON format:

{
"Stored_Procedure": "update_customer_orders",
"Data_Sources": [
{
"Table": "Customers",
"Fields": ["CustomerName", "CustomerID"]
},
{
"Table": "Orders",
"Fields": ["OrderID", "CustomerID"]
}
],
"Transformations": [
{
"Operation": "UPPER",
"Field": "CustomerName",
"Source_Table": "#TempTable"
}
],
"Data_Destination": {
"Table": "CustomerOrders",
"Fields": ["CustomerName", "OrderID"]
}
}

With LLMs, we can automate this process. The model can read and understand any type of code, identifying which tables are being accessed, what transformations are being applied, and where the results are being stored. It can then generate a human-readable description of the data lineage, or even a visual diagram showing the flow of data.

Automated lineage tracking, particularly in scenarios where lineage cannot be directly extracted such as many code-based operations, can be a game-changer for data governance. While tools like Purview excel at tracking lineage in structured, schema-based data sources, they often struggle with code-based operations like Stored Procedures, scripts, or custom applications. Leveraging the power of LLMs can bridge this gap, reducing manual effort and ensuring accurate lineage tracking as code evolves over time. This comprehensive approach to data lineage, covering both schema-based and code-based data operations, is a key aspect of Data Governance 3.0, providing a complete and accurate view of the data landscape.

Curating Data Assets: The AI-Driven Approach

LLMs can analyze and generate descriptive metadata for data assets, transforming the way we curate data. This involves understanding the content, context, and usage of data assets, and then generating relevant metadata such as definitions, summaries, keywords, or tags.

For example, an LLM could analyze a dataset of customer transactions, identify key characteristics such as the range of transaction dates, the most common transaction types, and the average transaction amount, and then generate a summary description and relevant tags for the dataset.

This automated curation process not only reduces the manual effort required for data curation but also enhances the discoverability of data assets. By generating rich, accurate, and up-to-date metadata, LLMs make it easier for users to search for and discover relevant data, thereby improving data accessibility and promoting data-driven decision-making.

Aligning Ontologies: The AI Bridge

LLMs can align different ontologies or taxonomies within an organization, a task that is crucial in fields like healthcare, manufacturing, finance, and more, where multiple complex ontologies are often in use.

In healthcare, an organization might use both the Systematized Nomenclature of Medicine — Clinical Terms (SNOMED CT) and the Logical Observation Identifiers Names and Codes (LOINC) in different systems. An LLM can recognize the equivalence between terms in these ontologies, such as the LOINC code “2160–0” and the SNOMED CT code “27113001”, both referring to “Serum Creatinine”.

In manufacturing, ontologies might include different classification systems for parts and processes. For example, one system might classify a part as “Bolt, Hex, M8, Steel” while another system refers to the same part as “Steel Hex Bolt, 8mm”. An LLM can understand that these terms refer to the same part and align them in the unified ontology.

In finance, different systems might use different terms for the same financial concepts, such as “Net Income”, “Net Earnings”, or “Bottom Line”. An LLM can recognize these terms as equivalent and align them in the unified ontology.

This seamless integration provides a consistent view of the data landscape, enhancing data interoperability. It ensures that the same concept is referred to consistently across all systems, reducing confusion and improving the accuracy of data analysis and reporting. By automating the process of ontology alignment, LLMs can help organizations in various sectors manage their data more effectively, leading to more informed decisions and improved outcomes.

Beyond Keywords: The Era of Semantic Search

LLMs can enable semantic search capabilities within data governance tools. This allows users to find relevant data assets based on the meaning and context of their queries, transcending the limitations of keyword-based searches.

To bring these concepts to life, let’s delve into a very simple example of how we can transform any question into a search query through the Atlas API and generate an answer based on the API response.

LLMs can enable semantic search capabilities within data governance tools, ushering in a new era of data discovery. Unlike traditional keyword-based searches, which simply look for exact matches of the search terms, semantic search understands the meaning and context of the query. This allows users to find relevant data assets even if they don’t know the exact terms used in those assets.

Consider a user who queries “What is the business glossary definition associated with the ‘CustomerID’ column in our sales database?” In a traditional keyword-based search, the system might struggle to return a meaningful result. However, a semantic search powered by LLMs can understand that the user is looking for the business glossary term associated with a specific data asset. It can then query the data governance tool for the ‘CustomerID’ column in the sales database, find the associated business glossary term, and return its definition.

To bring these concepts to life, let’s delve into a very simple example of how we can transform any question into a search query through the Atlas API and generate an answer based on the API response. This approach leverages the power of LLMs to understand the user’s intent, translate that intent into a query that the API can understand, and then interpret the API response to generate a user-friendly answer.

import streamlit as st
import pandas as pd
import openai
import requests
import json
from azure.purview.catalog import PurviewCatalogClient
from azure.purview.administration.account import PurviewAccountClient
from azure.identity import ClientSecretCredential
from azure.core.exceptions import HttpResponseError
import os
openai.api_key = os.environ.get('OPENAI_API_KEY')
openai.api_base = os.environ.get('OPENAI_API_ENDPOINT') # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
openai.api_type = 'azure'
openai.api_version = "2023-03-15-preview"# this may change in the future
deployment_id='gpt4-8k' #This will correspond to the custom name you chose for your deployment when you deployed a model.

client_id = os.environ('CLIENT_ID')
client_secret = os.environ('CLIENT_SECRET')
tenant_id = os.environ('AZURE_TENANT_ID')
reference_name_purview = os.environ('PURVIEW_NAME')

def get_credentials():
credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
return credentials

def get_purview_client():
credentials = get_credentials()
client = PurviewCatalogClient(endpoint=f"https://{reference_name_purview}.purview.azure.com", credential=credentials, logging_enable=True)
return client

def get_admin_client():
credentials = get_credentials()
client = PurviewAccountClient(endpoint=f"https://{reference_name_purview}.purview.azure.com/", credential=credentials, logging_enable=True)
return client



def search(question):
print("Extract term search")
system_message = "You are an assitant that will transform a question into a keyword search that will be executed into a Microsoft Purview environment."
prompt = "Given the following question, please provide the search terms that you would use to find the answer to the question by doing a search with the Microsoft Purview Search API endpoint.Here is the question: '"+question+"'"
terms = ask_gpt4(deployment_id, system_message, prompt)
return terms

def generate_answer(question, context):
print("Generate answer to the question based on the Purview search")
prompt = """Given the following question and context (in json format), please generate an answer to the question.Base your answer only on the context provided making reference to the data assets or terms you are given. If there would be potential answers list all of them.
QUESTION: {question}
CONTEXT: {context}
REMEBER TO ANSWER ONLY BASED ON THE CONTEXT YOU ARE PROVIDED AND MAKE REFERENCES TO THAT CONTEXT!""".format(question=question, context=context)
system_message = "You are an assitant that will answer some questions only based on the information that you would be provided through an API."
result = ask_gpt4(deployment_id, system_message, prompt)
return result



def ask_gpt4(engine_model, sys_message, question):
chatlog = [{
'role': 'system',
'content': sys_message,
}]
chatlog.append({'role': 'user', 'content': question})
response = openai.ChatCompletion.create(engine=engine_model, messages=chatlog)
answer = response.choices[0]['message']['content']
chatlog.append({'role': 'assistant', 'content': answer})
return answer

if __name__ == '__main__':
print("GET connection to Purview")
credential = get_credentials()
purview_catalog_client = get_purview_client()
print(purview_catalog_client)
st.title('Purview Search Copilot')
question = st.text_input('Input your question', '')
if (question!=''):
search_terms = search(question)
try:

st.write('You searched for:', search_terms)
body_input={
"keywords": search_terms
}
context = purview_catalog_client.discovery.query(search_request=body_input)
print(context)
response = generate_answer(question, context)
print(response)
st.write('Results: ', response)

except HttpResponseError as e:
print(e)

Following here you will see a screen capture of the Streamlit application for search:

Streamlit app for the code above

Conversing with Data: Purview Copilot

Imagine interacting with data governance tools as you would with a colleague. With the integration of LLMs, this is no longer a futuristic dream but a present reality. LLMs can power conversational interfaces, such as chatbots or voice assistants, enabling users to engage with data governance tools using natural language.

For instance, a user could ask a chatbot, “What is the definition of the ‘CustomerID’ field in our sales database?” or “Who has access to our financial data?”. The chatbot, powered by an LLM, could understand the question, query the data governance tool, and return a clear, concise answer in natural language.

This could be easily achieved by exporting a backup of your Microsoft Purview instance and indexing that data into Cognitive search and leveraging one of Microsoft Accelerator repositories.

This transformation can be achieved with a straightforward process that leverages the capabilities of Microsoft’s suite of tools. First, you would export a backup of your Microsoft Purview instance. This backup would contain all the valuable metadata and data catalog information that Purview has collected and organized.

Next, you would index this data into Azure Cognitive Search, a powerful AI-powered search service that allows you to search this complex structured and unstructured data in a variety of ways. Azure Cognitive Search can handle natural language queries, making it an ideal platform for integrating with a Large Language Model.

Finally, to streamline this process and ensure best practices, you can leverage one of Microsoft Accelerator repositories. These Accelerators are pre-built solutions designed to help you quickly start and implement your projects. They provide code samples, scripts, and other resources that can significantly speed up development and deployment.

By following these steps, you can create a powerful, AI-driven data governance tool that can understand and respond to natural language queries, making data governance more accessible and user-friendly for all users, regardless of their technical expertise.

Azure OpenAI Chat with your Data Accelerator

This conversational approach democratizes data governance, making it more accessible and user-friendly. It allows users to interact with data governance tools in a more intuitive and natural way, reducing the learning curve and making it easier for non-technical users to find the information they need. By bringing the power of natural language processing to data governance, we can make data governance a more integrated part of everyday business operations.

Conclusion

Governance 3.0, powered by large language models and innovative data catalog solutions like Microsoft Purview, could transform the way organizations manage and govern their data. The integration of Purview into a chat format can be a promising development, offering an enhanced user experience and streamlined data governance processes. As we move forward, we can expect Data Governance to continue evolving, leveraging LLMs and other advanced technologies to meet the ever-changing data governance needs of organizations.

--

--

Jorge G

Cloud Solution Architect in Data & AI at Microsoft