Snowflake Security: How to generate your real Anonymized and Synthetic Data (Part 1)

Generate your snowflake data in secured way for your company (Image generated by AI at simplified.com)

The security in our data is so important than managment itself. It doesn’t matter you have a good managment of your data, if your information falls into bad hands, you have fired and lost. There are many Snowflake security features that helps you to protect your data platform and data in many scenarios. You can follow my Snowflake Security series here for more information.

But, What does it happen when you have to test something in some local or develop environments, and you face with:

  • You don’t have data or you only have a few rows
  • The existing data doesn’t have the enough format quality to trust in your developments
  • It doesn’t exist integrity in your data

So, the lack of a good testing having reliable data makes that in occasions you have to rely in your team skills, and having lucky, you will deploy your project / new features in next environments with a high percentage of failing immediately, or makes you will face some incidents in the future. That generates insecurity, and fault of confidence in your deliverables, and finally you fail in your milestones for your project. Your project will become a failure.

In this article, you will see how to generate synthetic data in reliable manner to support your developments, adding the highest level of security to your data with the appropriate quality.

Analyzing Your Scenario

There are two ways of generate synthetic data:

  • Ethereal: The data is generated based on row generator functions. This method is usually used when you don’t have enough acknowledgement of your data. In some cases, in starting phases of your project. This part will be covered in next articles.
  • Real: The data is generated based on current data. This method is used frequently, in order to anonymize the data currently exist in your production environments and you want to provide data in lower ones. This article will be based on this scenario.

On one hand, you made a soft anonymization between production environments due to your stakeholders need to test and make the UAT, with the enough visibility and confidence to provide the GO live.

On the other hand, you made a hard anonymization in lower environments due to there are wide broadband of people like developers, testers, internships, etc that will access and you don’t want to share your business data in a open way.

Applying strategies for your synthetic data

You are on the way you want provide synthetic data into your environments, what would be the best strategy for this?

Row & Masking policies

It’s a good strategy due to is a dynamic way but this will have some inconvenients:

  • It will require you manage a lot of policies having in some cases limitations.
  • Applying policies will apply some extra charges on compute resources each time this information will be consumed, and it doesn’t mind if you can access or not.
  • It will imply vulnerabilities applying the same transformation, so it will depend of the values in your transformation in the policies you provide, you can associate always the same value to same row. So this scenario will allow some hackers infer information, and in some cases extract some insights from your information.
  • You can fall into having a poor quality data in some cases, that will deterior the agility of your tests and UATs.

Physical Synthetic Data

It’s other good strategy, but you have to have a good skills to achieve a good set of data. In this article you will improve this skills :)

We have to take into consideration that:

  • We will have to apply pass phrase to apply a different anonymization factor each time we want to generate our synthetic data. This process will be automatic but it has to be analyzed and defined aligned with your business. This step will be discussed in the next section.
  • It won’t incur in extra charges for each time you consume your data, is remember that is generated phisically, so it will only you be charged depending on that frequency.
  • The information provided will be generated periodically to avoid infer information, and not failing in vulnerabilites. We can decide when this information is expired to re-generate again our synthetic data.
  • We will keep a good quality of data, keeping the correct integrity in your datasets model.

Analyzing the generation of our Synthetic Data

This process is a bit complex but the understanding of the below steps will help you to become success in the benefits of your phisical synthetic data.

In order to apply a transformation of each field, you will have to consider the type of value (number, date, boolean,…) and its business value (what meaning is) if it really has it.

So once you know that, you have to think if you have a range of values for that.

  • Specific: If you have a few values you want to apply to your field.
  • Dynamic: Apply values dinamically generated in base to one field.

The global consistency is important if you always want to apply(or not) the same value to your data based on a specific field values or you want random values. In afirmative case (managed), you will have to get an anonmyzation factor based on the passphrase that will be the base you provide global consistency for your value. This is very useful when you want to provide integrity between different tables, providing the same value in all the fields that will contain this data.

So finally, you will get your values using some Random functions applying the prior logic and inputs.

Take note that you can apply all this logic both in Masking Policies or using it in a secure function / secure procedure. It’s important you can use secure UDF due to keep the privacy of your logic for each rule.

Theorical concepts

Prior to starting generate data, we have to know some basic concepts. In Snowflake, the most used function to generate information is the UNIFORM function.

UNIFORM( <min> , <max> , <gen> )

This function generates a uniformly-distributed pseudo-random number in the inclusive range min & max, based on the gen factor that is an expression that serves as a raw source of uniform random numbers. Typically the RANDOM function for not globally consistent data, but if we want to apply globall consistency we will have to indicate an anonymization factor based on our value logic, as mentioned earlier.

Anonymization Factor

If you execute twice your data generation using RANDOM function as a raw factor, you will see the unconsistency of the data:

SELECT TOP 5 C_CUSTKEY, RANDOM(), UNIFORM(1,5,RANDOM())
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER ;
First execution
Second execution

Each time you execute, you will see the function returns a different value, due to below considerations:

  • The uniform function is not receiveing any logic, only we base our values between a range from 1 to 5
  • The value of the raw random factor change every time without any control, so it changes randomly.

Now, if we need to apply consistency to our field C_CUSTKEY, our first objective will be to assign the same raw factor static based on the value of C_CUSTKEY. In this way, we will be able to use the same functionality to other tables to getting the same value applying anonymization.

Take into consideration, that this field could be an string, boolean, date, timestamp, etc, but in this case C_CUSTKEY is a number.

So here, it will take into the action our Anonymization factor. So this calculation factor will be based on a value in order to generate a random number value, that always will be the same.

Now, with this function named ANONYM_FACTOR will be able to transform any value as a INPUT and a passphrase, into a number with global consistency:

  CREATE OR REPLACE FUNCTION ANONYM_FACTOR(var_value varchar, var_pass_phrase varchar)
RETURNS NUMBER
AS
$$
SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,1,18)::NUMERIC(18,0)+
SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,19,18)::NUMERIC(18,0)+
COALESCE(try_to_number(SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,37,18)),0)::NUMERIC(18,0)
$$
;

Next, we test again our anonymized generation factor. So we set our passphrase for today:

SET PASSPHRASE ='our_current_temporary_consistency_day1';

We execute again the query:

SELECT TOP 5 C_CUSTKEY, 
ANONYM_FACTOR(C_CUSTKEY,$PASSPHRASE),
UNIFORM(1,5,ANONYM_FACTOR(C_CUSTKEY,$PASSPHRASE))
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER ;
First Execution

The second execution:

Second Execution

There is no change!! We have achieved a global consistency for our new field value on a range from 1 to 5, based on a field and a passphrase.

Imagine that we are going to re-generate our data, but using the same method applying global consistency. We only need to change our passphrase!

SET PASSPHRASE ='our_NEW__temporary_consistency_day2';
First execution
Second execution

Conclusion

Snowflake security is one of the most important topic you have to cover in your data architecture. As you can you see on this article, the concepts to generate secure anonymized data and synthetic data are trascendental to achieve the maximum level of security in your data platform, and taking into consideration one of most valuable requirement in all your projects, applying global consistency.

On the next article, we will see how to generate phisically Synthetic Data, in practical use case. A current data model will be used, and we will apply to all their fields a anonymized data in order to generate synthetized consitent information so you can use to deploy it in your lower environments. Please, pay attention on my next articles!

About me

Subject Matter Expert on different Data Technologies, with 20+ years of experience in Data Adventures. I am a Snowflake Architect, Snowflake Spotlight Squad Team Member and Snowflake Barcelona User Group — Chapter.

As a Data Vault Certified Practitioner, I have been leading Data Vault Architecures using Metadata Driven methodologies.

If you want to know more in detail about the aspects seen here, or other ones, you can follow me on medium || Linked-in here.

I hope you have joined and this can help you!

--

--

Cesar Segura
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault