Re-Imagine Data Governance with Snowflake Horizon and Powered by AI

In the era of data-driven decision making, establishing a robust Data Governance framework is essential. More often than not; Data Governance typically involves many manual processes across the Data Governance continuum and tends to become tedious, time-consuming, and prone to human error.

Re-Thinking Data Governance: From Traditional to Transformative

Snowflake’s expanding ecosystem of Data Governance Partners does offer comprehensive solutions for this challenge. Snowflake’s Partners integrate with Horizon and expand the scope and in addition, address wider data governance requirements across various platforms and sources.

Whereas, the objective of this blog is to examine how Snowflake Horizon’s “Data Governance” capabilities (Horizon is much broader), when combined with the strength of AI/LLM can lead to a transformative approach to Data Governance and help tackle the major obstacles faced by Data Governance teams. We’ll further explore how these capabilities help various personas within Enterprise Data Management including Data Engineers, Data Quality Analysts, Data Governors, Data Stewards, Data Scientists, Data Users, Data Ops professionals, and others, providing a holistic and inclusive approach to data governance and management.

Data Governance within Enterprise Data Management

Data Governance is very broad; Gartner defines Data Governance as “The specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption and control of data and analytics”. In essence, it’s the 3P; People, Processes and Product (Technology) collaborating together.

Within Enterprise Data Management, the typical roles that engage with data include Data Engineer, Data Quality Analyst, Data Steward, User, Data Scientist, Compliance & Security and others. The data and process workflow would more or less look like the figure below.

Enterprise Data Management Workflow

The Standards and Policies are setup by the DG Committee which, is then put into action by the rest of the personas. Once the data is loaded into the landing area, the data quality analyst analyzes, profiles, and pinpoints data quality issues that require rectification through DQ rules. Following this, a data engineer takes charge of loading the corrected data into Snowflake. Any business centric issues that fail the governance rules are then floated to a data steward who steps in to scrutinize the data for any business quality problems and coordinates with the stakeholders of the data sources to address the issues. Subsequently, the refined data becomes accessible to end-users who utilize it for their specific needs. Further, data scientists engage with the data, employing it to build and refine predictive models.

About Snowflake Horizon

Before we go any further, let’s introduce the star of the show — Horizon. Horizon is Snowflake’s built-in governance solution with a unified set of compliance, security, privacy, interoperability and access capabilities in the Data Cloud. Horizon’s governance umbrella covers a wide range of areas, with data governance being just one aspect that we will explore in this blog. Horizon is much broader and encompasses the governance of applications, models/AI and more.

Snowflake Horizon — Easy, Powerful and Actionable

With Horizon, Customers can:

  • Protect and audit their data with compliance, business continuity, data quality monitoring and lineage
  • Secure their environment with continuous risk monitoring and protections, RBAC and granular authorization policies
  • Unlock the value of their sensitive data with advanced privacy policies and data clean rooms
  • Integrate with other Apache Iceberg-compatible catalogs and engines, in addition to data catalog and data governance partners
  • Classify, share, discover and immediately act on data, apps and more across regions and clouds

For more information on Horizon, see here.

Using GenAI to lead Data Governance

In Snowflake, you can fine-tune and deploy your very own LLM models via Snowpark Container Services (SPCS) (PuPr). Further, SPCS wrapped with Snowflake Native App, enables customers to seamlessly access LLMs through the Snowflake Marketplace as well. For the scope of this blog however, we will use the Snowflake External Access (PrPr) feature to integrate with a 3rd party hosted model — OpenAI service.

Snowflake and LLM

An external LLM means that the data will be egressing Snowflake. While the demo here uses mock data; In the real world, you must ensure that robust security measures are in place before transferring sensitive data. You can find additional resources here

If OpenAI is not your favorite, then Snowflake has made this so easy to switch; If you would like to try with Amazon Bedrock, use instructions here. Instead, if you want to eliminate the egress of your data, and instead would like to fine-tune using models such as Llama2 7b on Nvidia GPUs all within Snowflake leveraging Snowpark Container Services, read here. You will conclude that the only limit is your willingness to try more. Ok. That was cheesy.

Transformative Data Governance — AI + Horizon

It is important to note that the quality of responses from a large language model (LLM) depends on the quality and relevance of the training data and the context provided for analysis. Therefore, verification of responses is crucial before relying on them. However, an LLM can also serve as a valuable starting point for investigation and data analysis, helping to identify potential issues and patterns more efficiently than manual methods. By leveraging LLMs for the “Where should I begin” approach, we can provide personas with a solid starting point and accelerate the discovery of insights that might otherwise require extensive manual effort.

With that context, lets begin our journey and start with the Data Security persona. Let’s use Patient data as our example as it belongs to one of the regulated industries (HCLS) and is easy to follow. However, as you can imagine, it can easily be leveraged for other verticals as well.

Data security professionals are continually monitoring for questionable data occurrences, a laborious and exhausting process that also depends heavily on the expertise of the security analyst. This is an ideal scenario for AI to contribute by providing guidance and support to alleviate the workload and enhance the process.

Below is an example where AI performs analysis on Snowflake’s Query_History and Access_History views, condensing the results into a concise summary. This proactive approach empowers the team to initiate investigations with valuable insights already in hand, rather than dedicating extensive time to sorting through raw data.

As you can see, the security analysts can easily eliminate noise and spend their time dedicated to investigating relevant events. Let’s see how easy it is to build this framework for invoking AI from within Snowflake. Once ready, this can be reused for the rest of the sections as well. For the demo purposes, I used a mock raw Patient data table.

Patient Table

Assuming that the reader has a good understanding of Snowflake concepts, let’s begin. We will create a stored procedure that takes in 2 inputs — a table name and a prompt/question; and invokes openAI.

Lets set the environment first. Set your database, schema, warehouse. Create one if you don’t have an existing one.

--Set your database, schema and Warehouse.  Create one if it does not exist for your demo.
USE DATABASE DG_DB;
USE SCHEMA DATA;
USE WAREHOUSE LLM_WH;

Create the Network Rule and the Secret for the External Access.

--CREATE A NETWORK RULE FOR YOUR EXTERNAL ACCESS INTEGRATION
CREATE OR REPLACE NETWORK RULE openai_network_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('api.openai.com');

--CREATE A SECRET WITH YOUR OPEN AI KEY
CREATE OR REPLACE SECRET openai_api_key
TYPE = GENERIC_STRING
SECRET_STRING = 'xx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

--CREATE THE EXTERNAL ACCESS INTEGRATION
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION openai_access_integration
ALLOWED_NETWORK_RULES = (openai_network_rule)
ALLOWED_AUTHENTICATION_SECRETS = (openai_api_key)
ENABLED = true;

Now, Lets now create a stored proc that calls the openAI

--Stored Proce that returns the data in a table as a text
CREATE OR REPLACE PROCEDURE get_table_data_as_string(table_name STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
// Fetch and format column names from the table
// Iterate over rows and append data
return resultString;
$$;

--The stored procedure to invoke OpenAI
CREATE OR REPLACE PROCEDURE SP_INVOKE_OPENAI(table_name STRING, prompt_ques STRING)
returns STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.9
HANDLER = 'SP_INVOKE_OPENAI'
EXTERNAL_ACCESS_INTEGRATIONS = (openai_access_integration)
PACKAGES = ('snowflake-snowpark-python','aiohttp', 'aiosignal', 'async-timeout', 'attrs', 'certifi',
'charset-normalizer', 'colorama', 'frozenlist', 'idna', 'multidict', 'requests', 'tqdm', 'urllib3', 'yarl')
SECRETS = ('openai_api_key'=openai_api_key)
AS
$$
import _snowflake
import logging
import openai
import sys
import openai
from snowflake.snowpark.functions import listagg

def (session,table_name,prompt_ques)->str:

ret_text = session.sql(f"call [[get_table_data_as_string('{table_name}')").collect()
entire_payload = prompt_ques + '\n' + ret_text[0][0]

openai.api_key = _snowflake.get_generic_secret_string('openai_api_key')

response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613",
messages=[{"role": "user","content":entire_payload}]
,temperature=1,)

return response.choices[0].message.content
$$;

Great! Now you can execute the stored procedure and engage OpenAI by simply providing the table name and a clear prompt specifying the assistance you require. Let’s try an example to create the glossary on a table.

As seen, the glossary for the table with the definition of each field is provided. By running the procedure on all key tables, you’ll have a comprehensive Glossary for all your essential assets. This task is typically labor-intensive for Data Stewards or SMEs, especially when dealing with hundreds of production tables, where the workload increases exponentially. Further, you can easily visualize the results in Streamlit, creating a personalized UI by calling the same procedure.

Let’s visualize this data on Streamlit below.

Glossary of Assets

Speaking of Data SME or Analysts, let’s get the trend summary of the patient table. This gives a good understanding to the user before they start analyzing the data further.

Trend Summary

In the realm of data analysis, there are roles that focus solely on analysis without requiring SQL expertise. These analysts can still tap into the insights within Snowflake tables, thanks to AI assistance. By posing questions in plain English, they can leverage AI to translate their queries into SQL, empowering them to explore data without needing to write code. Below is a scenario where an analyst wants to extract a list of all patients who reside in the state of Massachusetts. AI responds back with the SQL query.

Convert English to SQL

Is this getting interesting ? Let’s dive deeper and explore the possibilities to see just how fascinating it can get!

Let’s now assume the role of a Data Quality Analyst, responsible for ensuring data conformity to the DG committee’s standards. They must detect and address errors or data quality issues. Typically, they begin by profiling the data to identify patterns and ensure quality compliance. Let’s leverage AI to automate the initial step of identifying DQ challenges in a table.

To execute this in your environment, you would need to call the stored procedure SP_INVOKE_API with the appropriate prompt as an input, and the AI will take care of the rest. See below.

call SP_INVOKE_OPENAI('YOUR TABLE','Summarize the DQ challenges in the name field')

On Streamlit, this would render as below.

The output effectively identified most of the data quality issues embedded in the dataset, providing a comprehensive starting point for further investigation, profiling, and assessment by a DQ Analyst.

As a follow up to the above, Let’s ask to generate a custom regex that addresses the data issues. This will produce tailored regular expressions that standardizes the table’s data, proving a valuable tool for the Data Quality Analyst to ensure data consistency and accuracy. See below.

Regex generated by AI

Now that we’ve explored AI-driven capabilities, let’s shift focus to data-driven approaches. Let’s enhance our Data Governance strategy by incorporating the features of Snowflake Horizon.

As discussed before, Snowflake Horizon’s capabilities fall under Compliance, Security, Privacy, Interoperability, and Access. Let’s start with Compliance.

In fact, recall that our initial Security persona example leveraged data from Snowflake’s Compliance pillar, specifically the Access_History view. This view offers granular insights into data access patterns within Snowflake, serving as a valuable tool for auditing and monitoring. It empowers administrators and authorized users to monitor and analyze data access, ensuring comprehensive oversight.

Horizon’s Compliance pillar offers Data Quality Monitoring(PrPr), which helps evaluate quality of the data in a Snowflake table. Data Metric Functions (DMFs) are used to measure quality, serving Data Engineers, Data Quality Analysts, and Data Stewards. Snowflake provides pre-built system DMFs in SNOWFLAKE.CORE for common metrics, and users can define custom DMFs for precise measurements. These DMFs are stored in the chosen database and schema, enabling tailored data quality assessments.

Regardless of whether you utilize system DMFs, custom DMFs, or a combination of both, once assigned to a table or view, Snowflake stores the DMF results in the event table. You can schedule how often the DMF is executed, such as running DMFs on the PATIENT table three times daily. You can adjust the schedule as needed to meet your internal data quality requirements. All DMFs assigned to a table follow the same scheduled cadence.

Let’s begin by utilizing the pre-built System DMF for evaluating the quality of the Email field. Currently, the following System DMFs are available for use, as noted in this blog post.

Let’s utilize a few of the available System DMFs to demonstrate their functionality such as NULL count and Unique data count for Emails.

--Associate the System DMF to the Patient Table.
alter table PATIENT add data metric function snowflake.core.null_count on (email);
alter table PATIENT add data metric function snowflake.core.unique_count on (email);

--Run the DMF to obtain the metrics
select snowflake.core.null_count(select email from PATIENT);
select snowflake.core.unique_count(select email from PATIENT);

This indicates that all email fields contain a value, and there are 126 unique email addresses present in the Email field.

Next, let’s create a custom Data Metric Function (DMF) to assess the validity of the Email field. We’ll need to define the DMF, link it to a table, and schedule its execution based on our desired frequency to measure the metric.

-- Create a custom DMF that checks for validity of email with a regex
CREATE DATA METRIC FUNCTION IF NOT EXISTS
invalid_email_count (ARG_T table(ARG_C1 STRING))
returns number as
'SELECT COUNT_IF(FALSE = (ARG_C1 regexp ''^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'')) FROM ARG_T'
;

-- association the DMF to the table
alter table PATIENT add data metric function invalid_email_count on (email);

-- Set the schedule for the DMF on the table
alter table PATIENT set data_metric_schedule = '5 minute'; -- for demo purpose, use 5 minute

Let’s query to find the number of invalid email records as calculated by the DMF.

select /* show 1 */ invalid_email_count (select email from customers);

So, there are 176 email records that did not pass the regex for email.

Alternatively, you can retrieve historical metrics from the events table (filter by snow.data_metric.name is not null). In the example below, DATA_METRIC_EVALUATION_HISTORY is a view built on the events table.

Horizon’s Compliance feature also includes the Data Lineage Visualization UI (PrPr), a valuable tool for Data Stewards. This user interface visualizes the table and column lineage, tracing the flow of data from a source table to a target table, making it easier to understand data relationships and provenance.

Let’s explore a couple of features from Horizon’s Security pillar. If you have been using Snowflake for sometime, then you must have surely come across the features like Row Access Policies, Masking Policies, RBAC, Dynamic Data Masking, Conditional Masking etc. All these (and more) fall within the Security pillar of Horizon. These features are critical for the security of Patient’s PII and sensitive information. If you would like to test drive these features, then use the instructions here.

Another key feature of Horizon’s Security pillar is the Trust Center(PrPr) that allows the Security persona to evaluate and monitor your account for security risks. When the Trust Center detects violations of the Center for Internet Security (CIS) Snowflake Benchmarks, it creates statistics of these security risks for you to analyze. The Trust Center also reports the individual violations and provides statistics about when and how many login failures occurred on specific dates, allowing you to find patterns that might indicate security attacks. The statistics also provide you with details about authentication types that were used during the login attempts, helping you determine whether weak credentials are being used to authenticate.

Moving on to Snowflake Horizon’s Interoperability pillar, which supports Data Stewards and Data Governance initiatives by facilitating integrations with Data Catalogs and a growing network of Data Governance partners. These partners enable the consolidation of metadata from various enterprise data stores into a single, unified view, promoting comprehensive governance and oversight.

Interoperability also allows customers to choose how they integrate Snowflake into their architecture by allowing them to store data in Iceberg, an open-source data table format. This gives Data Engineers the flexibility to select the compute engine they prefer to use. Iceberg Tables can now be configured to use either Snowflake or an external service like AWS Glue as the catalog to track metadata. This is made possible by a new account-level object called a Catalog Integration (in public preview), which allows Snowflake to securely read metadata from AWS Glue and object stores.

Next, Horizon’s Privacy pillar enables you to harness the potential of your sensitive data while maintaining advanced privacy standards and utilizing secure data clean rooms.

Data Clean Rooms (DCRs) are secure environments that enable multiple organizations (or divisions of an organization) to bring data together for joint analysis under the governed and defined guidelines and restrictions that keep the data secure. These guidelines control what data comes into the clean room, how the data within the clean room can be joined with other data in the clean room, the kinds of analytics that can be performed on the clean room data, and what data — if any — can leave the clean room environment. Click here to get your hands dirty with DCR.

As part of Privacy, Horizon enables users to identify and tag PII and other sensitive data with Data Classifiers. Classification is a multi-step process that associates Snowflake-defined tags (i.e. system tags) to columns by analyzing the cells and metadata for personal data; this data can now be tracked by a Data Engineer that can protect the column containing personal or sensitive data with a masking policy or the table containing this column with a row access policy.

System tags are tags that Snowflake creates, maintains, and makes available in the shared SNOWFLAKE database. There are two Classification system tags, both of which exist in the SNOWFLAKE.CORE schema:

  • SNOWFLAKE.CORE.SEMANTIC_CATEGORY
  • SNOWFLAKE.CORE.PRIVACY_CATEGORY

The following table summarizes the relationship between each classification category and system tag, and the string values for each classification system tag.

Below is sample code to classify the columns in the table PATIENT.

--Classify the Data.  This function analyzes columns in a table and 
--outputs the possible categories and associated metadata.
select extract_semantic_categories('DG_DB.DATA.PATIENT');

call associate_semantic_category_tags('DG_DB.DATA.PATIENT',
extract_semantic_categories('DG_DB.DATA.PATIENT'));

When the table definition is viewed via Snowsight, the applied tags can be viewed

Data Engineers can further extend the data classification capabilities with regular expressions using the CUSTOM_CLASSIFIER class (PrPr) in the SNOWFLAKE.DATA_PRIVACY schema. These include classification categories for:

  • Bank account numbers
  • Passport identifiers
  • Diagnostic codes
  • Internal tracking identifiers that are relevant to your business

Sample code to use custom classifier

--Create a Custom classifier class
create or replace snowflake.data_privacy.custom_classifier demo_instance();

--Add regex for custom classificition
--Expression for Master Card
Call demo_instance!add_regex('MC_PAYMENT_CARD','identifier',
'^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$');

--Expression for Visa
Call demo_instance!add_regex('VS_PAYMENT_CARD','identifier',
'\b([4]\d{3}[\s]\d{4}[\s]\d{4}[\s]\d{4}|[4]\d{3}[-]\d{4}[-]\d{4}[-]\d{4}|[4]\d{3}[.]\d{4}[.]\d{4}[.]\d{4}|[4]\d{3}\d{4}\d{4}\d{4})\\');

--Expression for Amex card
Call demo_instance!add_regex('AMX_PAYMENT_CARD','identifier','^3[47][0-9]{13}$');

--Apply the custom classifier
CALL SYSTEM$CLASSIFY(
'crm_data.public.hr_data',
{'auto_tag': true, 'custom_classifiers': ['crm_data.public.demo_instance']}
);

And the last pillar of Horizon is Access. As part of Access, the features include the ability to Share, Discover and Immediately act on data, Apps and more.

With Data Share & Collaboration, you can efficiently collaborate across teams and business units without ETL or integrations by sharing direct access to live data and with Snowflake Native Apps. You can also Discover the rich set of content within your organization, as well as data and apps available on Snowflake Marketplace, with a single Universal Search (PrPr). Further, You can use natural language powered by LLMs to easily interact with and query data using Snowflake Copilot(PrPr)

Pushing The Envelope

We’ve explored how Horizon’s capabilities can enhance a Data Governance program and accelerate time to value with AI-driven insights for various Enterprise Data Management personas. Now, let’s make it even more exciting — leveraging AI to generate custom code tailored to your specific data, making data processing and transformation more efficient and effortless.

When prompted, AI generated a Python stored procedure tailored to the snowflake table’s data, specifically designed to cleanse the “Name” field. Additionally, it provided step-by-step instructions for execution, making it a game-changer for Data Quality Analysts who can now efficiently tackle data cleansing tasks with ease.

Name Cleansing Python Code

For Master Data Management, it is essential to de-duplicate the master data to ensure there is no ambiguity. Below is a Python stored procedure code snippet that AI generated that is tailored to the patient table data, which de-duplicates patient data using fuzzy matching. Further, it provided clear instructions for implementation, making it an invaluable tool for Data Quality Analysts / Data Engineers to streamline their data cleansing processes.

In Closing

In closing, our journey through the transformative capabilities for Data Governance using Snowflake Horizon, powered by the innovative integration of AI, is very promising. By reimagining traditional processes through the lens of AI-driven and data-driven methodologies, we’ve seen how the once tedious and error-prone tasks of data governance can evolve into a more streamlined, efficient, and accurate practice.

Horizon’s comprehensive suite, from compliance and security to privacy and interoperability, combined with AI’s analytical prowess, empowers data professionals across the spectrum. Whether you’re a Data Engineer, Quality Analyst, or Data Steward, the fusion of Snowflake Horizon and AI not only simplifies your workflow but also elevates the strategic value of data governance within your organization. As we step into the future, the potential for further innovations in this space is boundless, promising a more secure, compliant, and insightful data-driven decision-making environment.

--

--