Using Snowflake as an Anonymization Engine, Lessons learned

Published in

swile-engineering

10 min readSep 13, 2022

Exploiting private data has become a crucial issue for most startups. At Swile, we take this subject seriously. That’s why we have decided to anonymize our production data before using it for our internal usage.

In this article, I will explain why, and how, as a Feature Owner, I came to choose Snowflake as a low-cost anonymization engine.

Privacy Concerns: The Need of Anonymization

We live in an increasingly connected and digital world. Everyone leaves a digital footprint when they use the internet.

In truth, the numbers are overwhelming:

In the US, 70% of users agree that it has become more difficult to control who can access their personal information online. Only a third think that companies are generally doing a good job of keeping their users’ personal information safe. (source: Ipsos)
In EU, close to half (46%) of EU citizens worry about the use of personal data and information by companies or public administrations. (source: European Commission)

As part of the development of on-demand environments. Within the DevX team, we have initiated a new epic called “AnonymizeAllDatabases”.

This epic was intended to generate anonymized data from our production databases to populate our on-demand environments (and staging / local one as well).

Some will say that seed data is sufficient for testing, but this is not the case.
Indeed, on micro-services systems, which handle gigas of data every day, the data obtained is often far from what was originally expected.

In this case, there is nothing better than a set of data from production to understand what is really going on.

After discussing with stakeholders, the following prerequisites have been identified:

For each project, we need to be able to configure anonymized columns
We need to export databases at the same time to keep them sync
Transformation must be deterministic (some values are duplicated)
Once anonymized, we need to store transformed data to be able to use it

Now that the context is set, let’s get going!

I’m not the kind of guy who reinvents the wheel, so I considered several solutions before starting this journey.

Tonic.ai (alias “the cool boy”)

This solution has a lot to offer: On Premise, schema change notifications, automatic generators from machine learning inference. Their most important war trophy, the anonymization of Ebay’s staging data.

> Although very attractive, there are several drawbacks such as; the expensive price (prices start at 5k/month), the complexity given our rather basic needs (no need for calculated values, dynamic data truncation, …)

Dot Anonymizer (alias “cocorico”)

This French solution is competitively priced.
‏‏‎
> It is not flexible enough for our needs. For example, it is necessary to set up a destination database by ourselves in order to anonymize data. Moreover, the configuration is done through a complex software interface, available only under Windows.

PostgreSQL Anonymizer (alias “the right choice, at first glance”)

Contrary to the 2 previous solutions, postgresql_anonymizer is an extension to mask or replace personal data directly from the database when it is queried. It is open-source, well documented and allows values to be replaced in a deterministic way. It was therefore a first choice solution for us.

> We needed to be able to use the data on a local database as well. However, this solution forces us to transmit the raw data from production to the developers’ workstations. This is not an option. Furthermore, at the time of our research, the solution was not mature enough according to the documentation: “This project is at an early stage of development, and should be used carefully.

Developing an ETL: Not Such a Good Idea

After all these fruitless searches, my mind was wandering and lonely, so I decided to recreate a “from-scratch” anonymization system.

In front of your astonished eyes, here is the solution I came up with, which was supposed to meet our needs.

Here is the idea, in a nutshell:

Each project repository has a configuration file in it (aka anonymize.yml)
Each project has a Kubernetes Cronjob defined that do the following through a special “anonymize script”:
- Parse the anonymize.yml file
- Connect to the database and create a read stream from it
- Transform stream data according to the configuration
- Create a write stream (to a csv file) and write data in it

In parallel to the design of this architecture, I had already developed a large part of the script. A mistake that I would regret as you will see later.

At Swile, we have a system of Guilds. These transversal groups enable the teams to communicate on shared topics. At the moment, there is 2 mobile Guilds (iOS & Android), a Javascript Guild, a Ruby Guild and an Architecture Guild.

I was ready, my slides were done. Full of enthusiasm, it was time to present my concept in front of the Architecture Guild.

Although interesting, my proposal was not retained for several reasons:

Basically, the concept I had just designed is a kind of ETL (for Extract, Transform, Load). I just reinventing the wheel!
Since it runs from inside our Kubernetes Cluster, it’ll impact our infrastructure cost (and production cluster performance as well)

However, an interesting information was brought to me by the data team: “We still synchronize production data in our Snowflake data-warehouse to be able to do some Data Scientist Tasks, maybe can you use it too?”

Without thinking twice, I started looking for information about Snowflake and the possibilities it offers. The conclusion is without appeal, it rocks!

Snowflake: The Developer Friendly Data Warehouse

Snowflake Inc. is a « data warehouse-as-a-service » company that has been created in July 2012 by 3 data-warehouse experts. It allows users to store and analyze data using cloud-based hardware and software.

At Swile, we are using Snowflake in place of Amazon Redshift which was not adapted to our usage (mainly because it uses shared computing) and doesn’t provide Python or Javascript integration.

Snowflake has several uses, it allows us to model data, build analyses, format datasets and build beautiful dashboards.

The main Snowflake’s advantage compared to the competition: it allows us to split computations and pay per use!

Let’s talk about practical stuff, while Snowflake is geared towards data professionals (Data scientist, AI Engineer, …), it is also a very powerful and accessible tool for developers.

Just like any decent database system, Snowflake allows you to perform queries, create stored procedures, benefit from transactions, create views… and certainly many other features that I don’t know about as a Software Engineer.

As part of this epic, my first thought was to check if Snowflake allowed to execute functions to anonymize data.

How surprised I was to learn that Snowflake allowed the creation of UDFs (User-Defined Functions)! UDFS allow us to expand snowflake’s native features with our own data transformation functions, in several languages (Java, Python, JS or SQL)!

My usual language being Javascript, this was a perfect fit! Here is what a Javascript UDF declaration on Snowflake might look like below.

CREATE OR REPLACE FUNCTION array_sort(a array)
  RETURNS array
  LANGUAGE JAVASCRIPT
AS
$$
  return A.sort();
$$
;

And here’s its usage.

SELECT ARRAY_SORT(PARSE_JSON('[2,4,5,3,1]'));# Gives [1, 2, 3, 4, 5]

As you can see, the input and output of the function are typed. But it is also possible to use a type called variant that accept any type of data.

There are still 2 questions to be answered in order to select Snowflake as the final framework for our epic.

Is it possible to create a cron to regularly anonymize our data?
Is it possible to store the result of our transformations somewhere?

Well, I guess Snowflake has it all figured out, because yes, it’s possible!

Firstly, Snowflake allows you to create what is called a Task. A Task is used to execute a Single SQL statement or a stored procedure in a scheduled way just like a cron.

For example (from the official documentation), here’s a task that inserts the current timestamp into a table every 5 minutes.

CREATE TASK mytask_minute
  WAREHOUSE = mywh
  SCHEDULE = '5 MINUTE'
AS
INSERT INTO mytable(ts) VALUES(CURRENT_TIMESTAMP);

To answer the second question, Snowflake allows to transfer data to the outside (this process is called unloading). Here’s a simple example of unload, where the external storage is called “s3_bucket”.

COPY INTO '@s3_bucket/foo.csv.gz' FROM (select * from mytable)

Well, we have everything we need! The production data is already at our disposal, we can process it and transfer it at regular intervals.

Let’s get practical!!

Practical Use Case

As a feature owner, my role is to frame an Epic in order to respond to a problem. And so, create a set of tickets that can be handled by my team.

Before creating tickets, here is the architecture schema I imagined, and which was validated during the second presentation in Architecture Guild.

The most important part of the diagram can be found at the bottom left. It is composed of 2 parts:

The first part is a library called snowflake-anonymize-function which contains the Javascript code of our UDF (compiled from Typescript, of course). This one is synchronized with Snowflake every time it changes, from our CI.
The second piece is a CLI tool called snowflake-anonymize-cli which allows to transform a configuration file (aka “anonymize.yml”) into procedures and tasks (both for each table to anonymize), and to synchronize them with Snowflake, from our CI.

The snowflake-anonymize-function is exposed from inside Snowflake through the anonymize function. Here’s an example of usage:

SELECT anonymize(FIRST_NAME_COLUMN, 'firstname') as FIRST_NAME, anonymize(LAST_NAME_COLUMN, 'lastname') as LAST_NAME FROM USERS

The first argument is the value (the column name) and the second one is the “transformer” that is used to effectively transform this value. In this example there’s only “firstname” and “lastname”. But in reality, we’ve implemented a lot more (“iban”, “phone”, …)

As a result of this function, we get a value that is deterministic. That means that we always get the same output value for the same input value. This is an important feature, as we need to maintain consistency between data in different databases.

The snowflake-anonymize-cli (that has been developed by using the excellent oclif library) is distributed as a private npm package in our Github registry. It is used from inside the CI workflow of every anonymized project repository.

Here’s a real example of (truncated) configuration file that is interpreted by this CLI tool.

project:
  name: directorysources:
  directory:
    schedule: USING CRON 0 1 * * sun UTC
    prefix: fr.directory
    destination: "@FR.S3_PROD/directory/[DATE]"resources:
  - name: users
    source: directory
    columns:
      id: ~ # "~" = we keep the column as it is, without transform
      birthday:
        generator: date
      email:
        generator: email
      first_name:
        generator: firstname
      last_name:
        generator: lastname

And the call that synchronize it with Snowflake.

npx @themenu/snowflake-anonymize-cli@latest sync anonymize.yml

As command output, we obtain the following.

Dropping orphan tasks and procedures... 0 task(s) and 0 procedure(s) dropped.Synchronizing tasks and procedures instructions... 2 instructions synchronized

Procedures                                             
────────────────────────────────────────────────────── 
EXTRACT_ANONYMIZED_DIRECTORY_USERS_PROCEDURE 

Tasks                                             
───────────────────────────────────────────────── 
EXTRACT_ANONYMIZED_DIRECTORY_USERS_TASK

So, as you seen, the snowflake-anonymize-cli has created a new procedure and a new task. The generated code is the following:

CREATE OR REPLACE PROCEDURE
  EXTRACT_ANONYMIZED_DIRECTORY_USERS_PROCEDURE()
RETURNS VARIANT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$  const today = new Date().toISOString().slice(0, 10)  const st = snowflake.createStatement({
      sqlText: `
        COPY INTO '@FR.S3_PROD/directory/${today}/users.csv.gz'
        FROM (
          SELECT
            id,
            anonymize(birthday, 'date') as birthday,
            anonymize(email, 'email') as email,
            anonymize(first_name, 'firstname') as first_name,
            anonymize(last_name, 'lastname') as last_name
          FROM
            FR.DIRECTORY.USERS
        )
        FILE_FORMAT=(
          TYPE=CSV
          FIELD_OPTIONALLY_ENCLOSED_BY='"'
          COMPRESSION='gzip'
        )
        SINGLE=TRUE
        HEADER=TRUE
        OVERWRITE=TRUE;
     `
  });  const result = st.execute();
  result.next();  return result.getColumnValue(1)$$;CREATE OR REPLACE TASK EXTRACT_ANONYMIZED_DIRECTORY_USERS_TASK
  warehouse = 'DEVX_WH'
  schedule = 'USING CRON 0 1 * * sun UTC'
as
  call EXTRACT_ANONYMIZED_DIRECTORY_USERS();

As a result, the whole users table from the directory database will be anonymized every Sunday at 01:00 UTC.

By using a Procedure alongside the Task, we’re also able to run it by ourself by calling the following code on Snowflake (that’s useful for debug!).

call EXTRACT_ANONYMIZED_DIRECTORY_USERS_PROCEDURE();

And here we are, the POC is functional, and everything works as expected. I still have to create the tickets to extend the principle to all our projects and everything will be ready, right?

It would be a lie to say that we didn’t have any problems during the implementation, but on the whole everything went well. Nevertheless, here is a small list of the minor issues we encountered:

When a table is deleted from anonymize.yml , we need to remove associated procedure and task too
We need to handle arrays and json in our transform functions too
There are timeout and concurrency constraints on Snowflake, so it was necessary to shift some Tasks in time

Conclusion

To conclude, I would say that using Snowflake to anonymize large volumes of data (several hundred million rows in total) is a very cost-effective solution.

As a result, we consume about 0.5 Snowflake credit (about 2.5€) per day to anonymize over 250 tables. That’s an unbeatable price compared to specialized solutions on the market.

However, this solution is not perfect. It is necessary to create configuration files from the database schema beforehand, which is time-consuming.

I also learned a personal lesson from this experience, not to write a single line of code until I have discussed my idea with my peers.

Applying this principle would have saved me several hours of wasted time and the sadness of deleting code that I was attached to.