Snowflake Security: How to generate your real Anonymized and Synthetic Data (Part 2)

The Security is one of the most important term you have to take into consideration when you want acquire or design something.

Some examples, If you have ever bought a ground, and maybe there was a house, indepently of status how was the building what would be the first action you do? Yes, in that case, I highly recommend you should protect and delimit your terrain with a wall or fence, or your own house. The same case happens when you buy your mobile, prior to use it, you buy a protection case.

Your most precious value is your acknowledgement, you have to protect it overall the things. Along the time, you could build and improve it.

Following our path on Snowflake Security series the main objective we want to achieve is protect your Data. Along the previous article, I explained what are the main strategies to adopt to getting Anonymized and Synthetic Data. I also highlighted what are the concepts that you have to know in advance, providing you the needed skills prior to starting to generate your data.

In this article, we will see how with a practical use case in our emulated production environment, we will provide anonymized and synthetic data in a safest and protected way to lower environment like Development.

Providing diferent perspectives of your data to all users anytime, it allows you accelerate your business

Some of the concepts we will see on this article, have been explained in previous article Part 1.

Checking Prod Environment Model

Here we have an 3NF model, a sample of Snowflake provides us with the SNOWFLAKE_SAMPLE_DATABASE. We will think that is our model relationship tables in PROD environment. So, we will extract this information into our local database, in order to work into our use case.

In order that you can follow this article, and you can practice, I will provide the SQL codes below.

--WE CREATE OUR INITIAL PROD DATABASE
USE ROLE ACCOUNTADMIN;
CREATE DATABASE EDW;
CREATE SCHEMA DC_DATA_PSA;

--WE CREATE OUR MODEL PROD TABLES
CREATE OR REPLACE TABLE EDW.DC_DATA_PSA.ORDERS
AS
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;

CREATE OR REPLACE TABLE EDW.DC_DATA_PSA.CUSTOMER
AS
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER;

CREATE OR REPLACE TABLE EDW.DC_DATA_PSA.NATION
AS
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION;

If you check these tables, you will see that in some of them there are different PII sensitive data, like PRICE of every ORDER, or Address of Customers, so we will need to apply some techniques to generate alternative data to lower environments.

Hands On

Before starting to getting our Anonymized Tables, we must consider the different steps as shown bellow:

Synchronized Key Tables: These tables contain the key mapping between the root primary key, and the anonymized ones. For security reasons, these tables will be generated like temporary. We have to avoid the possibilities that these tables will be available for someone, in any case. So these tables will be dropped once you close the session.

Anonymized Tables: These tables will be the tables generated applying anonymization techniques from original and synchronized key tables.

First Step: Synchronized Key Tables

1.1 ORDERS table: Analyze / Understand

I start checking what is the technical schema of the primary keys. As usual, and for the rest of cases, the business unit wants to keep the as close to the reality as possible. So we will get some insights, due to this knowledge can’t be transmitted by any BU right now:

So we would like to know, what are min, max and how many rows there are on table:

With these inputs, we know that we will need an anonymized PK:

  • Schema Format: Numeric
  • Technique Anonymization: Fixed (It’s a PK, we need global consistency)
  • Based on O_ORDERKEY
  • Range Values: 1 and so far. If we want in a sequencial order, or not, it will depend to using an extra functionality to apply this logic. (We will use ROW_NUMBER)

1.2 ORDERS Table: Generate

We will generate our synchronized list of order table mapping keys:

--WE ESTABLISH OUR DAY 1 PASS PHRASE
SET PASSPHRASE = 'MYPASSDAY1';

--WE CREATE ROOT IDENTIFIER TABLES AS TEMPORARY
CREATE OR REPLACE TEMPORARY TABLE EDW.DC_DATA_PSA.SYN_LIST_ORDERS
AS
SELECT O_ORDERKEY, ROW_NUMBER() OVER (ORDER BY ANONYM_FACTOR(O_ORDERKEY,$PASSPHRASE)) O_ORDERKEY_ANONYNMIZED --> NUMBER FIXED, BETWEEN 1 AND so far, BASED ON O_ORDERKEY
FROM EDW.DC_DATA_PSA.ORDERS;

Note: You will need to create the ANONYM_FACTOR function. As you will see, this factor will be the key of all our generation synthetic data, and it will be applied in the most of logics. This function was mentioned on previous article, for more details about it, please refer to.

Here you are this function extracted from part 1:

  CREATE OR REPLACE FUNCTION ANONYM_FACTOR(var_value varchar, var_pass_phrase varchar)
RETURNS NUMBER
AS
$$
SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,1,18)::NUMERIC(18,0)+
SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,19,18)::NUMERIC(18,0)+
COALESCE(try_to_number(SUBSTRING((TO_DECIMAL(LEFT(MD5(var_value::VARCHAR||var_pass_phrase),30),'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
+TO_DECIMAL(RIGHT(MD5(var_value::VARCHAR||var_pass_phrase),2)||'a','XXX'))::VARCHAR,37,18)),0)::NUMERIC(18,0)
$$
;

Refering to how generate the logic in our Sync Order table; this new ORDERKEY will be generated based on the anonymization factor, that will use the O_ORDERKEY and the passphase. For each factor value, unique across all values we want to generate that will be in a sequence number, so in that case will use a ROW_NUMBER function to apply a sequence. We could also use a SEQUENCE object, but we want to simplify this practical use case.

So we will check, how does it look likes once this table has been generated:

2.1 CUSTOMER & NATION Tables: Analyze / Understand

So in the ANALYZE process for CUSTOMER and NATION tables, we see that are very similar to ORDERS table.

CUSTOMER table Analysis
NATION table Analysis

2.2 CUSTOMER & NATION Tables: Generate

So we will apply the same process like ORDERS:

CREATE OR REPLACE TEMPORARY TABLE EDW.DC_DATA_PSA.SYN_LIST_CUSTOMER
AS
SELECT C_CUSTKEY,
ROW_NUMBER() OVER (
ORDER BY ANONYM_FACTOR(C_CUSTKEY,$PASSPHRASE)) C_CUSTKEY_ANONYNMIZED
--> NUMBER FIXED, BETWEEN 1 AND so far, BASED ON C_CUSTKEY
FROM EDW.DC_DATA_PSA.CUSTOMER;
CREATE OR REPLACE TEMPORARY TABLE EDW.DC_DATA_PSA.SYN_LIST_NATION
AS
SELECT N_NATIONKEY,
ROW_NUMBER() OVER (
ORDER BY ANONYM_FACTOR(N_NATIONKEY,$PASSPHRASE)) N_NATIONKEY_ANONYNMIZED
--> NUMBER FIXED, BETWEEN 1 AND so far, BASED ON N_NATIONKEY
FROM EDW.DC_DATA_PSA.NATION;

Note: Due to we are already working on the prior session, so we keep the same PASSPHRASE. Important to keep global consistency.

3.x Other different tables: How to Analyze / Understand & Generate

We have seen that all our PKs in this model have been numbers, but right now, many of you will ask yourselves: What would be the approach if the PK was a text? How could we do that logic?

So the answer will be very easy, you could apply a function taking into consideration:

  • What is the max lenght of your field?
  • Is it the length variable?
  • We will use RANDSTR function
SELECT O_ORDERKEY, 
ROW_NUMBER() OVER (ORDER BY ANONYM_FACTOR(O_ORDERKEY,$PASSPHRASE)) O_ORDERKEY_ANONYNMIZED_NUMERIC_SEQUENCE, --> NUMBER FIXED, BETWEEN 1 AND so far, BASED ON O_ORDERKEY
randstr(uniform(10,20,O_ORDERKEY), ANONYM_FACTOR(O_ORDERKEY,$PASSPHRASE)) O_ORDERKEY_ANONYNMIZED_VARIABLE_LENGTH_10_and_20, ---- PK TEXT WITH VARIABLE LENGTH 10 AND 20 WITH GLOBAL CONSISTENCY
randstr(15, ANONYM_FACTOR(O_ORDERKEY,$PASSPHRASE)) O_ORDERKEY_ANONYNMIZED_FIXED_LENGTH_15 -- PK TEXT WITH FIXED LENGTH 15 WITH GLOBAL CONSISTENCY
FROM EDW.DC_DATA_PSA.ORDERS;

In the prior function, we will show the way how you have to generate a pk text both variable and fixed lenght using the range you need.

Here you are an example:

It doesn’t matter how many times you execute, this will be anonymized PKs fields, with global consistency applied, it will always return the same value. When you change the PASSPHRASE, it will return other value with Global consistency again.

Second Step: Anonymized Tables

Some of the methodology used here about how to approach the strategy to apply the anonymized logic to our tables, was mentioned in detail in the part 1, please check it for details.

Once we have arrived here, we have all the Synchronized Key tables generated, we will be able to generate without any prior dependency the rest of all our tables. This step can be done both directly on your target database, or by default in temporary tables prior to establish your own publish strategy to your lower environments.

Pay attention on every generation table, we will need always join with our prior Synchronized Key tables. At least, we will need the root Synch Key table but in other cases, if the table needs to reference to other ones, we will add some other extra joins to getting the PK anonymized fields.

1.1 NATION Table: Analyze / Understand

We will see how it looks like the PROD NATION table:

select * from EDW.DC_DATA_PSA.NATION order by 1;

We have to focus on schema, format, business meaning, what fields we should apply global consistency, or can be random, and the range of the values.

On the other side, for this table, we will discard other FK fields like N_REGIONKEY.

In the rest of the cases, you only need to apply the same strategy applied previsouly.

1.2 NATION Table: Generate

Here we have an example for our synthetic table, with comments on every field:

--WE CREATE OUR INITIAL PROD DATABASE
USE ROLE ACCOUNTADMIN;
CREATE DATABASE EDW_DES;
CREATE SCHEMA DC_DATA_PSA;

--WE CREATE OUR FIRST ANONYMIZED TABLE: NATION
CREATE OR REPLACE TABLE EDW_DES.DC_DATA_PSA.NATION
AS
SELECT
SYN.N_NATIONKEY_ANONYNMIZED N_NATIONKEY, --> GETTING PK ANOYMIZED FROM SYN LIST NATION TABLE
--> N_NAME --> VARCHAR variable LENGTH between 10 and 20, BASED ON N_NATIONKEY
RANDSTR(uniform(10,20,ROOT.N_NATIONKEY), ANONYM_FACTOR(ROOT.N_NATIONKEY,$PASSPHRASE)) N_NAME,
RANDSTR(uniform(50,100,ROOT.N_NATIONKEY),random()) N_COMMENT --> RANDOM VARCHAR variable LENGTH between 50 and 100
FROM EDW.DC_DATA_PSA.NATION ROOT
INNER JOIN EDW.DC_DATA_PSA.SYN_LIST_NATION SYN
ON ROOT.N_NATIONKEY=SYN.N_NATIONKEY;

We check will check how does it looks like our synthetic data:

Nation table anonymized

Pay attention that there is no correspondence between original N_NATIONKEY and anonymized one. For this you can check this in the SYNC Key map temporary table, every time you change your passphrase and you generate it again.

2.1. CUSTOMER Table: Analyze / Understand

Here we will have into consideration that there are other FK fields, so we will have to lookup the appropriate Syn Key Map tables, in this case NATION one.

We will see how it looks like the PROD CUSTOMER table:

select top 10000 * from EDW.DC_DATA_PSA.CUSTOMER order by 1;
SNOWSIGHT profilling — CUSTOMER table

So based in the prior information, we will take some notes:

C_NAME: is based on the combination on the C_CUSTKEY and String literal, I know that in real case is not like this. We will manage like VARCHAR.

C_ADDRESS: we imagine this is a real address, although on this case not, but will be an string with varible length. We will manage like VARCHAR.

C_NATIONKEY: One foreign key, we will get this from Syn Key map table

C_PHONE: This will be a number between one value range but it always has the same length. We will manage like NUMBER.

C_ACCTBAL: Other sensitive field, that contamplains different amount values, including negatives. We will specify a NUMBER range between -10000 to 10000.

C_MKTSEGMENT: This an attribute field, so there is no reference to other table. The possible values will be HOUSEHOLD, FURNITURE, MACHINERY, BUILDING and AUTOMOBILE. On this occasion, we will manage the assignment using an array with the possible values, and based on value range from the all possible values (a number from 1 to 5 (total possible values)), we will assign one category value or other one.

C_COMMENT: This will be a STRING random value, with a length between 50 and 100 characters.

2.2. CUSTOMER Table: Generate

Here you are an example how to do the prior analysis done:

CREATE OR REPLACE TABLE EDW_DES.DC_DATA_PSA.CUSTOMER
AS
SELECT
SYN.C_CUSTKEY_ANONYNMIZED C_CUSTKEY, --> GETTING C_CUSTKEY PK ANOYMIZED FROM SYN_LIST_CUSTOMER TABLE
SYN_NAT.N_NATIONKEY_ANONYNMIZED C_NATIONKEY, --> GETTING C_NATIONKEY PK ANOYMIZED FROM SYN_LIST_NATION TABLE
'Customer#'||SYN.C_CUSTKEY_ANONYNMIZED C_NAME,
--> C_ADDRESS --> VARCHAR variable LENGTH between 30 and 60, BASED ON C_CUSTKEY
randstr(uniform(30,60,ROOT.C_CUSTKEY), ANONYM_FACTOR(ROOT.C_CUSTKEY,$PASSPHRASE)) C_ADDRESS,
uniform(100000000000, 1000000000000-1, ANONYM_FACTOR(ROOT.C_CUSTKEY,$PASSPHRASE)) C_PHONE, --> NUMBER FIXED, BETWEEN 1*10^11 AND 1*10^12-1, BASED ON C_CUSTKEY
uniform(-10000::number(10,2), 10000.00::number(10,2), ANONYM_FACTOR(ROOT.C_CUSTKEY,$PASSPHRASE)) C_ACCTBAL,--> NUMBER FIXED, BETWEEN -10000.00 AND 10000.00, BASED ON C_CUSTKEY
--> VARCHAR RANGE 'HOUSEHOLD','FURNITURE','MACHINERY','BUILDING','AUTOMOBILE' WITH 5 VALUES, BASED ON O_ORDERKEY
GET(ARRAY_CONSTRUCT('HOUSEHOLD','FURNITURE','MACHINERY','BUILDING','AUTOMOBILE'),uniform(1, 5, ANONYM_FACTOR(ROOT.C_CUSTKEY,$PASSPHRASE))-1)::VARCHAR C_MKTSEGMENT,
RANDSTR(uniform(50,100,ROOT.C_CUSTKEY),random()) C_COMMENT --> RANDOM VARCHAR variable LENGTH between 50 and 100
FROM EDW.DC_DATA_PSA.CUSTOMER ROOT
INNER JOIN EDW.DC_DATA_PSA.SYN_LIST_CUSTOMER SYN --JOIN WITH SYNCHRONIZED CUSTOMER TABLE
ON ROOT.C_CUSTKEY=SYN.C_CUSTKEY
INNER JOIN EDW.DC_DATA_PSA.SYN_LIST_NATION SYN_NAT --JOIN WITH SYNCHRONIZED NATION TABLE
ON ROOT.C_NATIONKEY=SYN_NAT.N_NATIONKEY;

We execute and check it again, in order to see how it looks like the synthetic data:

CUSTOMER Anonymized DDL Creation
CUSTOMER Anonymized table

Above you can check that we have all the rows as Production table we have. But pay attention, in the case we want to reduce the quantity of CUSTOMER information, we only need to apply the appropriate logic on the SYNC Key Map CUSTOMER table.

3.1. ORDERS Table: Analyze / Understand

Here we will have into consideration that there are other FK fields, so we will have to lookup the appropriate Syn Key Map tables, in this case CUSTOMER one. In addition, we will get the ORDER PK anonymized information as well.

We will see how it looks like the PROD ORDERS table, in order to check what will be the different business logic we will have to apply in our case:

SELECT top 10000 * FROM EDW.DC_DATA_PSA.ORDERS order by 1;

Remember that you can use the Snowsight UI profiling, in order to check statics:

SNOWSIGHT profilling — ORDERS table

So based in the prior information, we will take some notes:

O_ORDERKEY and O_CUSTKEY: We will get joining with SYNC Key map tables

O_ORDERSTATUS: It will be generated from an array 3 values, ‘F’,’O’ and ‘P’

O_ORDERDATE: We see there is a range dates from 1998-08-02 and first date 1992-01-01, so we will keep this range. We could also change it as we want, but we will keep on this case. In order to apply dates, fixed its value based on specific fields, we will need:

  • Specific fields: O_ORDERDATE & O_ORDERKEY
  • Last Date: 1998–08–02
  • Days until the first date we want generate dates: In this case if we want the 1992–01–01, our value will be 2405 days.
dateadd(dd,- uniform(1, 2405, ANONYM_FACTOR(O_ORDERDATE||ROOT.O_ORDERKEY,$PASSPHRASE)) ,'1998-08-02'::DATE) O_ORDERDATE, -->FORMAT AS DESIRED

O_ORDERPRIORITY: It will be generated from an array 5 values, '1-URGENT','2-HIGH','3-MEDIUM','4-NOT SPECIFIED' and '5-LOW'

O_CLERK: It will be a VARCHAR fixed, having lenght 15, based on O_ORDERKEY field

O_SHIPPRIORITY: It will be generated from an array 2 values, 1 or 0

O_COMMENT: It will be a VARCHAR random, having lenght 100, based on O_ORDERKEY field

3.2. ORDERS Table: Generate

Here you are an example how to do the prior analysis done:

CREATE OR REPLACE TABLE EDW_DES.DC_DATA_PSA.ORDERS
AS
SELECT
SYN.O_ORDERKEY_ANONYNMIZED O_ORDERKEY, --> GETTING O_ORDERKEY PK ANOYMIZED FROM SYN_LIST_ORDERS TABLE
SYN_CUST.C_CUSTKEY_ANONYNMIZED O_CUSTKEY, --> GETTING C_CUSTKEY PK ANOYMIZED FROM SYN_LIST_CUSTOMER TABLE
--> VARCHAR RANGE 'F','O' WITH 2 VALUES, BASED ON O_ORDERKEY
GET(ARRAY_CONSTRUCT('F','O','P'),uniform(1, 3, ANONYM_FACTOR(ROOT.O_ORDERKEY,$PASSPHRASE))-1)::VARCHAR O_ORDERSTATUS,
uniform(100::number(10,2), 5000.00::number(10,2), ANONYM_FACTOR(ROOT.O_ORDERKEY,$PASSPHRASE)) O_TOTALPRICE, --> NUMBER FIXED, BETWEEN 1 AND 2M, BASED ON O_ORDERKEY
--> O_ORDERDATE: DATES --FIXED BY VALUE, BASED ON O_ORDERKEY AND O_ORDERDATE
--=> 2405 DAYS are the difference betwen last date 1998-08-02 and first date 1992-01-01
--so the below function will provide dates between the above range.
--We will specify, the last date and the days from the first date
dateadd(dd,- uniform(1, 2405, ANONYM_FACTOR(O_ORDERDATE||ROOT.O_ORDERKEY,$PASSPHRASE)) ,'1998-08-02'::DATE) O_ORDERDATE, -->FORMAT AS DESIRED
--> VARCHAR RANGE '1-URGENT','2-HIGH','3-MEDIUM','4-NOT SPECIFIED','5-LOW' WITH 5 VALUES, BASED ON O_ORDERKEY
GET(ARRAY_CONSTRUCT('1-URGENT','2-HIGH','3-MEDIUM','4-NOT SPECIFIED','5-LOW'),uniform(1, 5, ANONYM_FACTOR(ROOT.O_ORDERKEY,$PASSPHRASE))-1)::VARCHAR O_ORDERPRIORITY,
--> O_CLERK --> VARCHAR FIXED LENGTH IN 15, BASED ON O_ORDERKEY
randstr(15, ANONYM_FACTOR(ROOT.O_ORDERKEY,$PASSPHRASE)) O_CLERK,
--> NUMBER RANGE 1,0 WITH 2 VALUES, BASED ON O_ORDERKEY
GET(ARRAY_CONSTRUCT(
1,0 --NUMBER RANGE 1,0
),
uniform(1,
2 --WITH 2 VALUES
,
ANONYM_FACTOR(
ROOT.O_ORDERKEY -- BASED ON O_ORDERKEY
,$PASSPHRASE))-1)::VARCHAR O_SHIPPRIORITY,
RANDSTR(100,random()) O_COMMENT --> VARCHAR RANDOM
FROM EDW.DC_DATA_PSA.ORDERS ROOT
INNER JOIN EDW.DC_DATA_PSA.SYN_LIST_ORDERS SYN --JOIN WITH SYNCHRONIZED ORDERS TABLE
ON ROOT.O_ORDERKEY=SYN.O_ORDERKEY
INNER JOIN EDW.DC_DATA_PSA.SYN_LIST_CUSTOMER SYN_CUST --JOIN WITH SYNCHRONIZED CUSTOMER TABLE
ON ROOT.O_CUSTKEY=SYN_CUST.C_CUSTKEY;
ORDERS Anonymized DDL Creation
ORDERS Anonymized table

Wow!! We have done, all our model generation has been coded in order to automate our synthetic and generation data.

Orchestration

So once we have every table on our anoymized model, we should to think the best way to execute every step. I will always recommend encapsulate all the code on each phase inside a secured procedure.

For example, based on prior diagram about the strategy about generate the anonymized data, we will have:

  • Sync_Key_Map Secure procedure: This will execute all the Synchronized Key Mapping tables, that contains the relationship between the original PK and the anonymized ones.
  • Gen_Anonym_Data Secure procedure: This will execute every generation anonymized data process, in charge to load the different tables.

We won’t deep dive further, but think there will be multiple approaches, reusing the passphrase inside the procedure, reuse it through IN variable.

You could also generate, on the generation anonym data procedure on procedure by table. This will allow you to parallelize this last step, that it will be usually the heaviest on performance.

The most important thing, prior to execute these SQL sentences, is you have to oencapsulate inside a secure procedure all your code. This will allow you:

  • The appropriate privacy, that anyone around your dataplatform will see what is the passphrase used.
  • Anyone couldn’t monitor and check what is the business logic transformation on the current Generator Anonymized Data procedures.

A secure procedure can be made only specifying the keyword [SECURE]. For more information you can refer here.

Knowing the passphrase, and the logic transformations will allow malicious persons to decipher your information. For more details, check the first part.

Monitoring

Now we have our synthetic data, we will see if this data makes sense and can be used. You can check it, if you can see that your synthetic model has integral relationship between the different tables.

So we are going to build a dashboard, similar to production one, in order to check if the data makes sense.

First tab: DES MONTHLY MKT SEGMENTATION

Query:

SELECT COUNT(1) NBR_OF_ORDERS, COUNT(DISTINCT O.O_CUSTKEY) NBR_OF_CUSTOMERS, SUM(O_TOTALPRICE) TOTAL_PRICE,
C.C_MKTSEGMENT COD_MKT_SEGMENT, DATE_TRUNC('month', O_ORDERDATE) ORDER_MONTH_DATE,
O_ORDERSTATUS STATUS_OF_ORDERS
FROM EDW_DES.DC_DATA_PSA.ORDERS O
INNER JOIN EDW_DES.DC_DATA_PSA.CUSTOMER C
ON O.O_CUSTKEY = C.C_CUSTKEY
GROUP BY ALL
ORDER BY ORDER_MONTH_DATE DESC,COD_MKT_SEGMENT, STATUS_OF_ORDERS;

Snowsight UI Configuration:

Snowsight UI — First Tab DEVELOPMENT MONTHLY MKT SEGMENTATION

Second tab: DES MONTHLY MKT SEGMENTATION

Query:

SELECT COUNT(1) NBR_OF_ORDERS, COUNT(DISTINCT O.O_CUSTKEY) NBR_OF_CUSTOMERS, 
C.C_MKTSEGMENT COD_MKT_SEGMENT, DATE_TRUNC('month', O_ORDERDATE) ORDER_MONTH_DATE,
O_ORDERSTATUS STATUS_OF_ORDERS
FROM EDW_DES.DC_DATA_PSA.ORDERS O
INNER JOIN EDW_DES.DC_DATA_PSA.CUSTOMER C
ON O.O_CUSTKEY = C.C_CUSTKEY
GROUP BY ALL
ORDER BY ORDER_MONTH_DATE DESC,COD_MKT_SEGMENT, STATUS_OF_ORDERS;

Snowsight UI Configuration:

So, you could replicate both previous charts for PROD (only change the database in your queries refer to EDW instead on EDW_DES). And finally, you could implement a dashboard, like this in order to check and compare that your data makes sense in a technical way. Only for business strategy, the PROD real data should be the correct data.

Providing diferent perspectives of your data to all users anytime, it allows you accelerate your business

MONTHLY MKT SEGMENTATION: The amount are not realistic on DEVELOPMENT environment. We didn’t want to apply similar prices, although you could it.

MONTHLY CUSTOMER ORDER STATUS: The quantity overall order status in total are similar. But independently by every status, on DEV there is no the same behaviour like PROD.

Renew your synthetic data

In the case, you re-execute your code generation you will see that the data is not changing, this one of the features of the global consistency. But, if you change your passphrase and execute it again, now you will see that you have generated again new data. It will keep and make sense your anonymized / synthetic information in order to be used for technical puposes only.

SET PASSPHRASE = 'MYPASSDAY2';
Snowsight UI — DEVELOPMENT information after regenerating synthetic model with a new Passphrase

Take into consideration, that we are checking in a general overview that the data makes sense, and it’s correctly anonymized. But we have seen in a previous steps, every time you regenerate your synthetic data all your data will be refreshed with their attributes, PK fields, etc… so there will not be any change to know what was the PK previously for the current one. This is one of the most security reasons, to avoid infer the information by force trying to correlate between prior and current data.

Conclusion

Snowflake Anonimization process is very useful to boost your business, but you have to take into consideration some security aspects to generate your data, and a good strategy to automate and provide this data in a safest way.

As you can see, it has been easy to generate your data and give a useful meaning. So right now, we have unvealead that this is not a taboo aspect of your Data Business. Only we have to check, analyze and apply the correct business transformation so you will be able to code your Synthetic Data.

Afterthat, you will have to consider how to orchestrate your Data Generation and Provisioning, using Tasks or external third tools, but you always use secured procedures to execute the business anonymization data.

Please pay attention, on the previous article, it will be useful to understand the most of aspects that took into consideration on this article.

About me

Subject Matter Expert on different Data Technologies, with 20+ years of experience in Data Adventures. I am a Snowflake Architect, Snowflake Spotlight Squad Team Member and Snowflake Barcelona User Group — Chapter.

As a Data Vault Certified Practitioner, I have been leading Data Vault Architecures using Metadata Driven methodologies.

If you want to know more in detail about the aspects seen here, or other ones, you can follow me on medium || Linked-in here.

I hope you have joined and this can help you!

--

--

Cesar Segura
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault