Using Realistic Values in Snowflake’s Dynamic Data Masking

TL;DR:

To make masked PII data look realistic in Snowflake, you can create and apply dynamic data masking policies that cross-walk raw PII values to randomly reshuffled values from the same list of distinct values.

Introduction

Dynamic data masking is one of the most widely used Snowflake data governance features. It enables organizations to mask sensitive data so only authorized users can access raw unmasked values. To implement dynamic data masking, you create a masking policy that applies a mask to the data — for example, a string of asterisks ‘*****’— based on the viewer attribute, such as their role. You then apply the masking policy to the columns in a table or view, or to a tag used to identify sensitive columns.

Dynamic Data Masking

One downside of dynamic data masking is that, when the data is masked, unauthorized users are no longer able to get a sense of what the unmasked data looks like, and distributions of masked values no longer reflect those of the raw data. This can be undesirable, for example, when you are developing a demonstration or a training system and the unauthorized users needs to see realistic-looking data, while making sure that sensitive data is protected — with minimal overhead.

In this article, I demonstrate an approach to implementing “realistic” data masking using a user-defined function (UDF) and reshuffled raw data.

Steps

1. Create testing data

First, let’s create some sample testing data that we will use to demonstrate this approach. Let’s pretend that we need to mask PII data for a table of student records in San Francisco. I use the Python faker library to generate student first and last names. I also generate a ZIP code column with a distribution mimicking that of the top 10 ZIP codes in San Francisco.

/*** Create testing data ***/

use role sysadmin;

create or replace database student_db;
use schema student_db.public;

--function to generate fake names - can be extended to generate other pii
create or replace function faker(a string)
returns string
language python
runtime_version = 3.8
packages = ('faker')
handler = 'udf'
AS $$
from faker import Faker
def udf(a):
fake = Faker()
if a=='first_name':
return fake.first_name()
if a=='last_name':
return fake.last_name()
else:
return 'other'
$$
;

--create a table of zip code weights (top 10 zip codes in San Francisco by population)
create or replace table weight_zipcode
(zipcode varchar, weight float, cum_weight float)
;

insert into weight_zipcode values
('94108', 15.23031224, 15.23031224),
('94109', 13.3505242, 28.58083644),
('94102',13.1968623, 41.77769874),
('94133',11.3248546, 53.10255334),
('94117',9.740997719, 62.84355106),
('94115',8.467799078, 71.31135014),
('94110',8.402777585, 79.71412773),
('94112',7.059555798, 86.77368352),
('94114',7.011762221, 93.78544574),
('94103',6.214554255, 100)
;

--generate testing data
create or replace table student_db.public.student_pii as
with zip_random as (
-- generate a 1000 records with random dimension values
select
(abs(mod(random(),10000))+1)/100 zip_r
from table(generator(rowcount=> 10000))
)
select
faker('first_name') first_name,
faker('last_name') last_name,
zipcode
from zip_random zip_r
inner join weight_zipcode wzip on
(zip_r between wzip.cum_weight-wzip.weight and wzip.cum_weight)
;

Let’s verify that the testing data looks as expected:

select * from student_pii;

select zipcode, count(*)
from student_pii
group by 1
order by 2 desc;

2. Randomly reshuffle PII

The next step is to create three cross-walks that take distinct raw PII values and randomly reshuffle them. Note that each cross-walk is deterministic — each real value maps to a single fake value that remains the same until the cross-walk is regenerated. Thus, when you run queries on masked data, you should expect to see the the same result across runs, unless the cross-walk is updated between the runs.

I concatenate reshuffled values with a “(fake)” suffix to make them easier to tell apart for testing. This suffix can be removed when testing is finished.

--Create a cross-walk from real names to randomly reshuffled names 
--from the same list of distinct names
create or replace table student_db.public.first_name_xwalk as
with real as
(select row_number() over (order by first_name) ID, first_name
from
(select distinct first_name
from student_pii
order by 1)),
fake as
(select row_number() over (order by rand) ID, first_name
from
(select distinct first_name, random() rand
from student_pii
order by 2))
select real.first_name, fake.first_name||' (fake)' fake_first_name
from real
inner join fake
on real.ID=fake.ID;

--confirm that the cross-walk looks as expected
select * from student_db.public.first_name_xwalk;

Repeat for last name and zip code columns:

--repeat for last name
create or replace table student_db.public.last_name_xwalk as
with real as
(select row_number() over (order by last_name) ID, last_name
from
(select distinct last_name
from student_pii
order by 1)),
fake as
(select row_number() over (order by rand) ID, last_name
from
(select distinct last_name, random() rand
from student_pii
order by 2))
select real.last_name, fake.last_name||' (fake)' fake_last_name
from real inner join fake
on real.ID=fake.ID;

select * from student_db.public.last_name_xwalk;

--repeat for zip code
create or replace table student_db.public.zip_xwalk as
with real as
(select row_number() over (order by zipcode) ID, zipcode
from
(select distinct zipcode
from student_pii
order by 1)),
fake as
(select row_number() over (order by rand) ID, zipcode
from
(select distinct zipcode, random() rand
from student_pii
order by 2))
select real.zipcode, fake.zipcode||' (fake)' fake_zipcode
from real inner join fake
on real.ID=fake.ID;

select * from student_db.public.zip_xwalk;

3. Create UDFs to use in masking functions

Next, I create three UDFs that will be used to mask the data. Each UDF simply cross-walks a real value to the corresponding reshuffled value. Notice the use of the any_value() function — it ensures that only one value is returned per row and thus prevents the unsupported query type error when the UDF is used in a data masking policy.

--create UDFs
create or replace function student_db.public.first_name_swap(name varchar)
returns varchar
language sql
as
$$
select any_value(fake_first_name) first_name
from first_name_xwalk
where first_name=name
$$;

create or replace function student_db.public.last_name_swap(name varchar)
returns varchar
language sql
as
$$
select any_value(fake_last_name) last_name
from last_name_xwalk
where last_name=name
$$;

create or replace function student_db.public.zip_swap(zip varchar)
returns varchar
language sql
as
$$
select any_value(fake_zipcode) zipcode
from zip_xwalk
where zipcode=zip
$$;

4. Create and apply masking policies

Lastly, I create and apply dynamic data masking policies leveraging the UDFs. If the user querying the data is in the PII_user role, they will see unmasked data. All other users will see masked data.

--create data masking policies
create or replace masking policy student_db.public.first_name_mask as
(val string) returns string ->
case
when current_role() in ('PII_USER') then val
else student_db.public.first_name_swap(val)
end;

create or replace masking policy student_db.public.last_name_mask as
(val string) returns string ->
case
when current_role() in ('PII_USER') then val
else student_db.public.last_name_swap(val)
end;

create or replace masking policy student_db.public.zipcode_mask as
(val string) returns string ->
case
when current_role() in ('PII_USER') then val
else student_db.public.zip_swap(val)
end;

show masking policies;

--apply masking policies
alter table student_db.public.student_pii
modify
column first_name set masking policy student_db.public.first_name_mask,
column last_name set masking policy student_db.public.last_name_mask,
column zipcode set masking policy student_db.public.zipcode_mask
;

Test masking with sysadmin role:

use role sysadmin;

select * from student_db.public.student_pii;

select zipcode, count(*)
from student_db.public.student_pii
group by 1
order by 2;

Since sysadmin is not a PII_user, this role sees masked data.

Conclusion

In this article, I demonstrated how to use dynamic data masking such that the data looks realistic even when masked. It’s accomplished through masking policies that apply a UDF replacing raw values with randomly reshuffled values from the same list.

To operationalize this, you would need to regularly update the cross-walks to add new values as they are added to the data — something you can easily implement through a Snowflake task.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data