How to Redact PII in Snowflake with AWS Comprehend and External Network Access

Thousand Island Lake, Ansel Adams Wilderness

Introduction

Restricting access to personally identifiable information (PII) is a common organizational need. It is fairly straightforward to apply data masking policies to structured data, when the values neatly align with specific columns (name, address, ID, date of birth, etc.), but takes some effort when PII is dispersed inside unstructured text fields. The traditional approach is all or nothing — allow full access to the unstructured text fields to authorized users, and remove access completely from everyone else. But what if you could just redact PII in text dynamically, based on the user’s level of access, while still providing access to valuable business information for those without the elevated level of access?

In this article I demonstrate how to use AWS Comprehend — a natural language processing service — together with Snowflake’s External Network Access to dynamically redact PII from text fields in Snowflake. While there are many ways to identify PII entities in text, I found that AWS Comprehend does a pretty good job while being easy to use. External network access is a new feature of Snowflake that I personally can’t get enough of — it makes accessing external APIs a breeze, abstracting away the complexity of setting up API Gateways and remote services like AWS Lambda or Azure Functions. As of this writing, external network access is available in Snowflake on AWS and Azure, except for government regions.

Step by Step Guide

Step 1. Set up your AWS environment

  • Using AWS IAM, create a new user and associate it with a policy that provides access to the AWS Comprehend Service.
  • Then, using the Security Credentials tab in the user’s IAM screen, create a new Access Key with Access Secret and download them.

Step 2. Setup your Snowflake environment

In Snowflake, create a new NETWORK RULE pointing to the Comprehend endpoint (choose the endpoint aligned with your region), and store the access key and the access key secret in the SECRET objects in your database. Then, create an EXTERNAL ACCESS INTEGRATION object that combines the network rule and secrets.

--setup testing database and schema
USE ROLE SYSADMIN;
CREATE DATABASE IF NOT EXISTS TEST;
USE SCHEMA TEST.PUBLIC;

--create network rule and secrets
CREATE OR REPLACE NETWORK RULE TEST.PUBLIC.AWS_COMPREHEND_RULE
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('comprehend.us-west-2.amazonaws.com');

--In AWS Console, create a user and attach policies with full Comprehend access.
--Then use the Security Credentials tab to generate access keys
CREATE OR REPLACE SECRET TEST.PUBLIC.AWS_COMPREHEND_ACCESS_KEY
TYPE = GENERIC_STRING
SECRET_STRING = 'xxxxxxxxx';

CREATE OR REPLACE SECRET TEST.PUBLIC.AWS_COMPREHEND_SECRET_KEY
TYPE = GENERIC_STRING
SECRET_STRING = 'xxxxxxxxx';

--create external access integration
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION AWS_COMPREHEND_INTEGRATION
ALLOWED_NETWORK_RULES = (AWS_COMPREHEND_RULE)
ALLOWED_AUTHENTICATION_SECRETS = (AWS_COMPREHEND_ACCESS_KEY, AWS_COMPREHEND_SECRET_KEY)
ENABLED = true;

GRANT USAGE ON INTEGRATION AWS_COMPREHEND_INTEGRATION TO ROLE SYSADMIN;

Step 3. Create UDF

Create a User-Defined Function (UDF) that accepts a character string and uses the boto3 API to call AWS Comprehend service, identifies the location of all PII entities, and then replaces PII entities with a string of asterisks.

--create UDF
USE ROLE SYSADMIN;
USE WAREHOUSE ADHOC;
CREATE OR REPLACE FUNCTION TEST.PUBLIC.REDACT_PII(txt STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.10
HANDLER = 'redact_pii'
EXTERNAL_ACCESS_INTEGRATIONS = (AWS_COMPREHEND_INTEGRATION)
PACKAGES = ('boto3', 'snowflake-snowpark-python')
SECRETS = ('access_key' = AWS_COMPREHEND_ACCESS_KEY, 'secret_key' = AWS_COMPREHEND_SECRET_KEY)
AS
$$
import _snowflake
import boto3
from snowflake.snowpark.files import SnowflakeFile

access_key = _snowflake.get_generic_secret_string('access_key')
secret_key = _snowflake.get_generic_secret_string('secret_key')
region_name = 'us-west-2'

def redact_pii(txt):
# Initialize the AWS Comprehend client with access key and secret key
comprehend = boto3.client(
service_name='comprehend',
region_name=region_name,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key
)

# Detect PII entities
response = comprehend.detect_pii_entities(Text=txt, LanguageCode='en')

# Initialize offset
offset = 0

# Sort entities by their starting index in ascending order
entities = sorted(response['Entities'], key=lambda x: x['BeginOffset'])

# Redact PII entities from the text
redacted_text = txt
for entity in entities:
start = entity['BeginOffset'] - offset
end = entity['EndOffset'] - offset
length = end - start
redacted_text = redacted_text[:start] + '*' * length + redacted_text[end:]
offset += length - len('*' * length)

return redacted_text
$$;

Step 4. Test the function!

Call the function to test it:

select redact_pii('my new cat\'s name is Fluffy and he was born around March 31, 2023') redacted;

Step 5. Create a dynamic data masking policy

Create a dynamic data masking policy that redacts text when the user executes a query with an unprivileged role. Pick a table for testing, or create some completely fictitious testing data, like I did below (thanks, ChatGPT!). Create the unprivileged role if it doesn’t already exist, and grant it the necessary permissions to view the data.

--create some fictitious PII data
create table test.public.sample_text_with_pii (txt varchar);

insert into test.public.sample_text_with_pii values
('Hello, my name is Alexander Hamilton and my email address is alex.hamilton@example.com. I was born on January 11th, 1755, in Nevis. My phone number is (212) 555-1234, and I am inquiring about my application.'),
('Good day, I am Elizabeth Bennet, with a passport number of A12345678 issued in the United Kingdom. My address is 1813 Pemberley Park, Derbyshire, England. I need assistance with my visa status, please.'),
('Hi, my name is James Bond, and my driver\'s license is 007112233 issued by the British government. My bank account number is 12345678 at the Royal Bank of London. I need to update my direct debit details.'),
('This is Jane Doe speaking. My credit card number is 4111 1111 1111 1111 with an expiration date of 12/25. I\'m calling to report a fraudulent transaction on July 4th, 2023.');

--create a masking policy to redact PII
create or replace masking policy test.public.mask_PII as
(val string) returns string ->
case
when current_role() ='NON_PII_ROLE' then test.public.redact_pii(val)
else val
end;

--apply the policy to the table
alter table test.public.sample_text_with_pii
modify column txt set masking policy test.public.mask_PII;

--create a non-pii user role and test the policy
use role securityadmin;
create role non_pii_role;

grant usage on database test to role non_pii_role;
grant usage on schema public to role non_pii_role;
grant select on table test.public.sample_text_with_pii to role non_pii_role;
grant usage on warehouse adhoc to role non_pii_role;

grant role non_pii_role to user <user_name>;

Step 6. Test the policy

Test the policy by running a query under an unprivileged role

use role non_pii_role;
select * from test.public.sample_text_with_pii;

Conclusion

AWS Comprehend and Snowflake external network access make redacting PII from unstructured text easy, so you can give the users access to the data they need, preserving the analytical value of the data while protecting sensitive information from those who are unauthorized to access it.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data