Flaker 2.0 — Fake Snowflake data the easy way

The ability to generate convincing synthetic data is as important as ever, and is useful in a variety of situations:

  • Replacing Personally Identifiable Information with something that retains the same shape and structure of the original — you can use this to build non-production versions of your data pipelines. 🕵️
  • Taking your own dataset and extrapolating it to something equivalent but much larger, to test its future performance. 📈
  • Generating a dataset for demonstrating a data product or analytical technique. 🧑‍🏫
  • Generating a dataset as part of demonstrating a SQL scenario, like an example query to post to a public forum. 🧑‍💻

It’s been almost two years since I first blogged about fake data generation on Snowflake, which was later featured in an interview on Behind the Data Cloud.

The first version of “Flaker” (Faker for Snowflake) was created as a Snowflake External Functions wrapper, and required an AWS deployment.

While Javascript UDFs have existed on Snowflake for some time, External Functions were the first step toward a more general extensibility of the platform.

The challenge with External Functions is that you need more than just a Snowflake account — you need an account with a public cloud provider like AWS or Azure, and you need to deploy some infrastructure that Snowflake communicates with.

This is a powerful feature, and makes sense when the information you need is, well…external. But when you just want to invoke a code library, it’s not the usual level of simplicity and time-to-value that Snowflake customers are used to.

If you check out the original Flaker codebase, you can see that it involves a counter-productive detour into cloud infrastructure land before the faking can commence. Too much of this in a product (as others in the market are guilty of!), and a data analysts’ job starts to look like this:

Anyway, fast forward to a couple of months ago, and I was excited to be included on a Private Preview of Python for Snowflake. Naturally, one of the first things I did was to try to run Faker directly from within Snowflake.

How it works

The process of doing this, frankly, was so fast and simple that it barely warrants a blog post! So I did what any technical writer would do; add more features and examples until there’s enough to talk about 😅

Here’s everything you now need to generate fake data on Snowflake:

I’ll step through and explain what’s going on, and do some examples at the end.

We define the function and its parameters, this is the interface that SQL users see. We’ll return a variant, because sometimes Faker generates composite information, like a complete user profile.

We mark it as volatile so that Snowflake doesn’t cache results (we want each row to contain a uniquely generated value). We include the Anaconda packages ‘faker’ and ‘simplejson’ from the Snowflake channel, and we tell Snowflake to invoke the ‘fake’ function we provide as the entry point.

Lines 13 and 14 are really just a courtesy to the caller, so they can pass NULL as a parameter instead of an empty object.

Finally, on line 16 you can see that we convert the Faker output to JSON and back again before returning, and we tell the json.dumps call to default to str (a string). The reason we do this is because Faker uses all sorts of Python types in its output (Decimals, Dates etc), so when Snowflake needs to convert the result back to a variant, there’s too much ambiguity to do it correctly. So we intervene and just force a conversion to string.

Examples

Data Generation

50 names in the United States locale:

select FAKE('en_US','name',null)::varchar as FAKE_NAME
from table(generator(rowcount => 50));

100 dates between 180 days ago and today:

select FAKE('en_US','date_between',{'start_date':'-180d','end_date':'today'})::date as FAKE_DATE
from table(generator(rowcount => 50));

It doesn’t have to be people either. All sorts of random things like barcodes are supported:

select FAKE('en_US','ean',{'length':8})::varchar as FAKE_BARCODE
from table(generator(rowcount => 50));

Data Anonymization

Let’s finish with a more elaborate example. Instead of using a generator function, let’s take an existing table of data and anonymize it conditionally, based on who is querying it.

Create or choose a database and schema you can write to, and create a view over the standard Snowflake benchmark dataset like so:

create view CUSTOMER as (select * from SFC_SAMPLES_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER)

If we take a peek at this data, we see some customers.

select C_CUSTOMER_SK,C_SALUTATION,C_FIRST_NAME, C_LAST_NAME,C_BIRTH_COUNTRY, C_EMAIL_ADDRESS
from CUSTOMER
where C_CUSTOMER_SK <= 5;

Let’s pretend these are real people who’s Personally Identifiable Information we want to limit access to.

We can create some data masking policies that generate fake information for anyone who isn’t a SYSADMIN.

create or replace masking policy first_name_mask as (val string) returns string ->
case
when current_role() in ('SYSADMIN') then val
else FAKE('en_US','first_name',null)::varchar
end;

create or replace masking policy last_name_mask as (val string) returns string ->
case
when current_role() in ('SYSADMIN') then val
else FAKE('en_US','last_name',null)::varchar
end;
create or replace masking policy country_mask as (val string) returns string ->
case
when current_role() in ('SYSADMIN') then val
else UPPER(FAKE('en_US','country',null)::varchar)
end;
create or replace masking policy email_mask as (val string) returns string ->
case
when current_role() in ('SYSADMIN') then val
else FAKE('en_US','ascii_safe_email',null)::varchar
end;

Now we assign the masking policies to the customer table:

alter view CUSTOMER modify column C_FIRST_NAME set masking policy first_name_mask;
alter view CUSTOMER modify column C_LAST_NAME set masking policy last_name_mask;
alter view CUSTOMER modify column C_BIRTH_COUNTRY set masking policy country_mask;
alter view CUSTOMER modify column C_EMAIL_ADDRESS set masking policy email_mask;

Now if I query the table as SYSADMIN, I continue to see the same results as above.

I’ll grant the PUBLIC role ability to query the view and invoke the function:

grant select on view customer to role public;
grant usage on function FAKE(VARCHAR, VARCHAR, VARIANT) to role public;

Now, if I switch to the PUBLIC role and run the same query as before, now I see fake customers:

Can it support my use case?

For the full list of providers supported by Faker, see here: https://faker.readthedocs.io/en/master/providers.html

You’ll see that the examples tend to call a function, like fake.name(). But the documentation states:

You can see that we’re doing the same thing in the Snowflake function. Most of the standard and community Providers should just work, and you can also write your own. If they have complex inputs, you may need to construct your own UDF to handle those parameters more easily.

I need help!

If you’re having any trouble getting your fake data to generate the way you like, feel free to ask a question in the Snowflake Community forums: https://community.snowflake.com/s/forum

--

--