Empowering Data Analysts: Leveraging ai_mask with Unity Catalog in Databricks SQL for Enhanced Data Governance

Databricks SQL SME
DBSQL SME Engineering
8 min readJun 30, 2024
Output of AI mask if users have access

Author: Osama Mansour, Solutions Architect @ Databricks

Introduction

In the era of Big Data and AI, where information flows like water, securing sensitive data has become paramount for organizations across the globe. Data analysts are often at the forefront of this task, ensuring that data not only remains insightful but also secure and compliant with various data protection regulations. Enter Databricks’ ai_mask function, a powerful tool in the arsenal of data governance.

In this blog, we will utilize ai_mask to mask potentially sensitive data automatically, embed this function into a broader data protection frame with with MASK and Dynamic Views in Unity Catalog, and finally, we will leverage Unity Catalog’s robust data asset tagging framework to automatically tag tables and columns with any recognized PII data.

Understanding the ai_mask Function

At its core, the ai_mask( function is a feature of Databricks SQL that allows data analysts to mask sensitive information within text data automatically. By leveraging state-of-the-art generative AI models, the function can identify and mask specified entities, such as personal identification numbers, email addresses, and phone numbers, in a given text. This capability is crucial for maintaining privacy and adhering to data protection standards like GDPR and CCPA.

Key Features of the ai_mask() Function

  • Generative AI Model Integration: The ai_mask() function utilizes advanced generative AI models, such as Mixtral-8x7B Instruct, to accurately identify and mask specified entities in text data. This integration allows for high precision and flexibility in masking various types of sensitive information.
  • Support for Multiple Languages: While the function is tuned primarily for English, the underlying language model can handle several languages, making it versatile for global applications. This multi-language support ensures that organizations can apply data masking across diverse datasets.
  • Customizable Masking: Users can specify the types of information they wish to mask by providing an array of labels, such as ‘person’, ‘email’, ‘phone’, ‘address’, ‘Credit card number’, ‘Medical records numbers’, ‘SSN’, etc. This feature allows for targeted masking based on the sensitivity and relevance of the data.
  • Compliance with Data Privacy Regulations: By masking sensitive information, the ai_mask() function helps organizations comply with various data privacy laws and regulations. This compliance is crucial for businesses operating in jurisdictions with strict data protection standards.
  • Rate Limiting and Licensing: The function is subject to rate limiting to ensure fair usage of the underlying Foundation Model APIs. Additionally, the models used by the ai_mask() function are licensed under the Apache 2.0 license or Llama 2 community license, which users are recommended to review for compliance.

The Role of ai_mask in Data Governance

Data governance encompasses the processes, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. Here’s how the ai_mask function fits into the data governance framework:

  • Privacy Compliance: By automatically masking sensitive information, the ai_mask()function helps organizations comply with privacy laws and regulations, reducing the risk of data breaches and the associated penalties.
  • Data Minimization: A key principle of data privacy is collecting only the data necessary for its intended purpose. ai_mask()aids in data minimization by masking unnecessary sensitive details within larger datasets.
  • Access Control: While not a direct access control tool, ai_mask complements access control policies by ensuring that even if individuals have access to certain data, sensitive information within that data is not exposed unless explicitly allowed. ai_mask can be used as an input for context-specific tagging policies at a table and even column level. This function combined with the Unity Catalog asset tagging framework, serves as a powerful tool to find and proactively management PII access throughout your environment.

When you should or should not use ai_mask

The ai_mask() function leverages large language models (LLMs) in the backend, making it particularly effective when dealing with extensive text or strings. Its optimal application arises when you have a column in your table containing free-form text, such as patient notes, customer feedback, or survey results.

However, the ai_mask() function may not perform optimally for concise text inputs. For instance, if you have columns like ‘first_name’ or ‘last_name,’ the LLM model might occasionally struggle to accurately recognize whether a person’s name is an entity name (e.g., “Montreal” or “Key”) or if the entity name is not in English, even if it has been marked as ‘Name’ in the mask field.

More broadly, the ai_mask function should be leveraged and applied within a comphrensive governance and masking framework that it design specifically for each users business and legal requirements. It is a very powerful tool for speeding up, automating, and improving accuracy/completeness of masking, but it should never be the sole line of defense for protecting data.

Practical Applications of ai_mask

The versatility of the ai_mask function allows it to be applied across various industries and scenarios. For instance, in healthcare, analysts can use ai_mask to anonymize patient records before sharing them for research purposes. In finance, ai_mask can mask account numbers or transaction details in financial reports shared with regulatory bodies.

AI Mask in Action — Scenario: Enhancing Patient Data Privacy in Hospital Admission Records

Background: Hospitals collect sensitive personal information during patient admissions, including names, contact details, and health conditions. This data is crucial for treatment but must be handled with strict confidentiality to comply with regulations like HIPAA in the U.S.

Challenge: Data analysts in healthcare often need to analyze patient data to improve care delivery, predict resource needs, or conduct health research. However, they must do so without compromising patient privacy.

Solution Using Databricks SQL AI ai_mask Function. In healthcare, it is important to consider the access level of the end user, and combine ai_mask with other more stringent and deterministic rules-based / column based approach.

Step 1: Masking Sensitive Information

Data analysts can use the ai_mask function to automatically mask sensitive information in patient admission records. This function can be applied directly within SQL queries, making it easy to integrate into existing data workflows.

select
admission_id
,patient_first_name
,contact_number
,patient_notes
from patient_data;
Sample Healthcare Data without AI_mask

Step 2: Analyzing Masked Data

With the sensitive information masked, analysts can safely perform various analyses to improve hospital operations and patient care. For example, they can analyze the frequency of certain diagnoses, prepare resources based on treatment plans, or study admission trends without risking patient privacy.

We cannot just use ai_mask as the single point of governance for PHI data, so first, we need to create a more coarse governance function to provide access to certain columns with ANY potential to have PII to only admin-level groups with allowable access to the data. To be able to do this, we’ll create a SQL UDF to check if the user is part of the ‘admins’ group, then he’ll be able to see unmasked data, else, she’ll be able to see AI masked data:

CREATE OR REPLACE FUNCTION fn_mask_patient_notes(patient_notes STRING)
RETURNS STRING
RETURN
CASE
WHEN is_member('admins') THEN patient_notes
ELSE ai_mask(patient_notes, array('person', 'email', 'phone', 'address'))
END;

In this function, we first make sure the end user access the proper level of access to be able to query/see the data at all, then if they do, we will add another layer of PHI protection with ai_mask to mask any entities in the clinical notes such as person, emails, phone numbers, or address.

Next, we’ll create a View and call the UDF from the View to apply the masking logic:

CREATE OR REPLACE VIEW vw_secure_patient_data AS
SELECT
admission_id
,fn_mask_patient_notes(patient_notes) AS patient_notes
FROM patient_data;

Now let’s run the View to test, if the user is part of the ‘admin’ group, he’ll be able to see everything, if the user is part of ‘analysts’ group, she’ll be able to see masked data, otherwise they will get ‘access denied’ message.

select * from vw_secure_patient_data
Output of AI mask if users have access

Now let’s combine the ai_mask() function with MASK function to mask more columns in our dataset, in this scenario, we’ll use ai_mask() function to mask the patient_notes column, and use MASK to mask the patient_first_name and contact_number columns:

Step 1: Create the UDF for Masking:

CREATE OR REPLACE FUNCTION fn_mask_patient_data(
patient_notes STRING,
patient_first_name STRING,
contact_number STRING
)
RETURNS STRUCT<
patient_notes STRING,
patient_first_name STRING,
contact_number STRING
>
RETURN
CASE
WHEN is_member('admins') THEN STRUCT(patient_notes, patient_first_name, contact_number)
ELSE STRUCT(
ai_mask(patient_notes, array('person', 'email', 'phone', 'address')),
MASK(patient_first_name, '*', '*', '*', '*'),
MASK(contact_number, '*', '*', '*', '*')
)
END;

This UDF checks if the current user is a member of the ‘admins’ group. If they are, it returns the unmasked values for patient_notes, patient_first_name, and contact_number. Otherwise, it applies the ai_mask function to patient_notes and the MASK function to patient_first_name and contact_number.

Step 2: Create the View with Row-Level and Column-Level Security

CREATE OR REPLACE VIEW vw_secure_patient_data AS
SELECT
admission_id,
masked_data.patient_notes AS patient_notes,
masked_data.patient_first_name AS first_name,
masked_data.contact_number AS contact_number
FROM patient_data,
LATERAL (
SELECT mask_patient_data(patient_notes, patient_first_name, contact_number) AS masked_data
) masked;

Lets breakdown the steps happening in this function:

  • UDF Creation: The mask_patient_data function uses the is_member function to check if the current user is part of the ‘admins’ group.
  • If the user is an admin, they see the unmasked patient_notes, patient_first_name, and contact_number.
  • If the user is not an admin, the ai_mask() function is applied to the patient_notes with specified labels to mask entities like ‘person’, ‘email’, ‘phone’, and ‘address’. The MASK function is applied to the patient_first_name and contact_number columns.

View Creation: The view secure_patient_data applies the mask_patient_data UDF to the patient_notes, patient_first_name, and contact_number columns.

Step 3 — Adding Table and Column Tagging from the masked content

Finally, we should now use the information we are generating to properly tag our assets at the table and column level in Unity Catalog. We can tag data with PII by using the ALTER TABLE <tbl> ALTER COLUMN <col> SET TAGS command:

DECLARE OR REPLACE VARIABLE set_tags STRING;

-- Dynamic Variable to determine if column has PII
SET VARIABLE set_tags = (SELECT CASE WHEN SUM( CASE WHEN patient_notes ILIKE('%[MASKED]%') THEN 1 ELSE 0 END) > 0
THEN "ALTER TABLE vw_secure_patient_data ALTER COLUMN patient_notes SET TAGS ('PII' = 'yes')"
ELSE "ALTER TABLE vw_secure_patient_data ALTER COLUMN patient_notes SET TAGS ('PII' = 'no')"
END AS SetPIITAg
FROM vw_secure_patient_data);

EXECUTE IMMEDIATE set_tags;

This provides us with a robust framework for not just masking data, but for automatically tagging our tables, schemas, and columns across our environment to recognize and securely manage access to PII.

Conclusion

In conclusion, Databricks’ ai_mask function is a game-changer for data analysts focused on data governance and security. By integrating ai_mask into their data governance workflows, analysts can ensure that their organizations’ data handling practices are not only compliant with privacy regulations but also secure against unauthorized access. As data continues to be an invaluable asset for organizations, tools like ai_mask will play a critical role in safeguarding this asset, enabling analysts to focus on unlocking the value of data without compromising on privacy and security.

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL