Data Masking with Tokenization using Google Cloud DLP and Google Cloud Dataflow

11 min readMay 25, 2022

What

Cloud DLP is one of Google Cloud’s tools to provide inspection and de-identification upon any data which suspected to contain any sensitive information. In most cases, it is used to prevent any PII data from being accessed by external parties.

Cloud Dataflow is one of Google Cloud’s data pipeline tools to provide ETL pipeline for data transformation. It supports autoscaling and batch processing. In most cases, it is used to ingest live, streamed data and process it to be transformed into useful information.

In this post, we will try to automate data masking process using DLP by leveraging Cloud Dataflow as its transformation tool.

Why

In most companies, they feel the need to protect / hide any sensitive information contained within their data. This came from the need to protect their customer’s information and / or complying any regulation (eg. HIPAA, ISO27000, etc).

Cloud DLP provide solution to address this problem. Basically, Cloud DLP provide 3 main features : Data inspection; to identify any sensitive information within document / image, data deidentification; to mask any sensitive information within document so those information is hidden; and data re-identification risk measurement; to analyze sensitive data to find properties that might increase the risk of subjects being identified, or of sensitive information about individuals being revealed. To address problem stated earlier, we will focus in the first two of Cloud DLP features : data inspection and de-identification.

Cloud DLP uses InfoType to recognize any sensitive data that might be found within the documents. InfoType is a set of characteristics that matches the format of specific type of data. For example, there is InfoType to recognize any data that most likely is Passport ID number based on its unique format. Currently, Google Cloud DLP hundreds of Infotype to be used within Cloud DLP. We also can create our own custom Infotype using regex if data that we want to identify has a format that is not supported by Google’s DLP current list of Infotypes.

The process of inspecting and de-identifiying the data in Cloud DLP basically composed of these steps :
1. Provide the document to inspect
2. Run Cloud DLP job to inspect it
3. Cloud DLP will then identify and mask any sensitive data based on InfoType included in Cloud DLP job configuration

While the steps above is simple, the problem arises when we have to continuously inspect and mask any incoming live data. We need some automation to continuously ingest any live, inputted data to be sent to Cloud DLP, then publish the de-identified data somewhere else. Google Cloud Dataflow can address this problem.

How

Below is the architecture that we will try in this post :

The high level of what will happen are these :

Data is uploaded to Google Cloud Storage. We will prepare multiple .csv files containing sensitive and not-sensitive data.
These .csv files then are ingested into Cloud Dataflow pipeline.
In Cloud Dataflow, we create pipeline that will ingress any .csv files from Google Cloud Storage, send them to Cloud DLP for data de-identification, then publish them as table containing de-identified data in BigQuery
Dataflow send the files into Cloud DLP.
Cloud DLP then perform data inspection and de-identification towards those files. In Cloud DLP, we specifiy Inspection and De-identification templates containing what kind of data need to be inspected and de-identified, and how to de-identify it. We will use tokenization using Google Cloud KMS to provide encryption key to de-identify the data.
Cloud DLP then return de-identified data to Cloud Dataflow.
Cloud Dataflow send them to BigQuery. One .csv file should be exported as one table in BigQuery.

And these are how we do it :

I have created 3 .csv sample files, containing sensitive and non-sensitive informations. In this example, I will try to use DLP to inspect any data containing Indonesia’s Single Identity Number (Nomor Induk Kependudukan / N.I.K). NIK contains 16 numeric digits, started with 317508, followed by the person’s birth date-month-year and region code. For example : 3175082109050005. You can download these 3 csv files here : https://github.com/ardhnyg/medium_post_week4_may.

First csv file is valid_sample.csv. It contains list of all valid NIK.

Second csv file is invalid_sample.csv. All NIK within this file are in the wrong format.

Third csv file is mixed_sample.csv. It contains only one valid NIK, while the other NIK within this file is in the wrong format.

Then I uploaded all these files in Google Cloud Storage. I have a bucket called bucketdemoardhan which will I use as a source for Dataflow ingestion.

After the source is ready, then I need to create template in Cloud DLP to inspect and de-identify the data. Since I want to de-identify the data using tokenization, first I have to create encryption key in Google Cloud KMS then wrap it. I have created key ring in Google Cloud KMS named dlp-key-ring and a key in that key ring called dlp-key. We can create key ring in Cloud KMS by going to Google Cloud Console -> Navigation Menu -> Security -> Key Management -> Create Key Ring. After you create a key ring, click your key ring, and click Create Key.

We can’t use the key yet, until we wrap it using a wrapper key, because DLP de-identification using tokenization will use that wrapper key. In this example, I am using openssl method in wrapping my key through Google Cloud Shell, but you can use any other method.

In Cloud Shell, I do this :

Create AES 256 key openssl rand -out "./aes_key.bin" 32

Encode it in base64 base64 -i ./aes_key.bin

You will get an output containing random base64 character in a string. Save this value, since this is base64-encoded key that will be wrapped. For example, what I get is kvnzjh56=+1ca/lgx+

Then, use Cloud KMS key that we have created to wrap base64 key.

curl "https://cloudkms.googleapis.com/v1/projects/service-project-ardhan-1/locations/global/keyRings/dlp-key-ring/cryptoKeys/dlp-key:encrypt" \
  --request "POST" \
  --header "Authorization:Bearer $(gcloud auth application-default print-access-token)" \
  --header "content-type: application/json" \
  --data "{\"plaintext\": \"kvnzjh56=+1ca/lgx+\"}"

Remember to change the value in bold text with your Project ID, Key ring name, key name and base64-encoded key respectively.

You will get result like this :

{
  "name": "projects/service-project-ardhan-1/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key/cryptoKeyVersions/1",
  "ciphertext": "S/wD/lyyi0kLN8W+iXG5WkgpD9TV7mCoaEASwaquupm1bUZ3UQK20kXCvnD8UY1WTKG5Z0BSA5JuLbHXZ3uOHh8OMo=jj3sadqEhq7aOPfPBoQCGYWkZHx2sF63oYl8BH9IE2w8YGHyrs8bJZddDCJmD",
  "ciphertextCrc32c": "901327763",
  "protectionLevel": "SOFTWARE"
}

Save the string in bold, as we will use this string as Wrapped Keyin our DLP De-identification template.

After we create wrapper key, time to configure our DLP. Basically, we have to create two things in DLP for this demo : Inspection template, contains what kind of InfoType that has to match with sensitive information within the sample, and Deidentification template, containing how to transform the data so that the sensitive information cannot be extracted.

To create Inspection template, we can go to Navigation Menu -> Security -> Data Loss Prevention -> Configuration -> Templates -> Inspect -> Create Template. Type in your Inspection template ID, name and make sure for InfoType, you choose INDONESIA_NIK_NUMBER. Choose the likelihood of the data to match this InfoType; you can choose one of multiple values : Very Unlikely, Unlikely, Possible, Likely, Very Likely, with the order from more likely to trigger false positive to most unlikely. I choose Possible in this example. If you want to identify other kind of data, you can choose other type of InfoType. You can put multiple InfoType within the same template to identify multiple kind of sensitive informations within the same document. In this example, I create an Inspection template called inspect-nik.

For De-identification template, we can go to Navigation Menu -> Security -> Data Loss Prevention -> Configuration -> Templates -> De-Identify-> Create Template.

Define the template name and Data tranformation type; InfoType meaning your data will be treated as free-form text or Record to treat your data as structured data. In this example, I choose InfoType. In this example, I create a template called deid-template.

Still in the same wizard, in Configure de-identification section, choose the Transformation method. Since we want to tokenize the sensitive data using encryption key from Google Cloud KMS, choose Pseudonymize (cryptographic deterministic token). In Key options, choose KMS wrapped crypto key and specify your KMS key and Wrapped key that you have created in previous steps. For Surrogate infoType, type in character that will be embedded in the starting string of your tokenized data. For example, I type in “NIK”, so any transformed tokenized data that I will see, I can identify it with character “NIK” as its starting character. In InfoTypes to transform, choose Specify infoTypes and choose INDONESIA_NIK_NUMBER.

Transformation method, KMS key, wrapped key and surrogate character(s)

As you can see in the example above, the transformed sample will always starts with “NIK”, as I put “NIK” in the Surrogate field.

In this point, Cloud DLP configuration is finished. Before put it in Dataflow configuration, we have to create BigQuery Dataset as the location of transformed data, pushed by Dataflow. We can go to Navigation Menu -> BigQuery (in Analytics Section) -> SQL Workspace. Click your Project ID in the BigQuery, click 3 dots icon (action), and choose Create dataset. In this example, I have created a dataset called demoardhan.

Now it’s time to put them all together in Cloud Dataflow. Go to Navigation Menu -> Dataflow (in Analytics section).

Thankfully, Dataflow has many built-in template that we can use, so we dont have to create the pipeline manually. One of the template is called Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP).

In Dataflow window, click Create Job From Template. In Job creation window, specify the job name (in this case, my job name is dataflow-dlp-deid-demo), and choose Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP) template.

Still in same configuration window, in Required parameters section, type in :
1. gs://[yourbucketname]/*.csv in Input Cloud Storage File(s) field, since we want to process any csv files (hence, *.csv) contained in the bucket. In this example, I put gs://bucketdemoardhan/*.csv.
2. [Your-BigQuery-dataset-name] in BigQuery dataset field. In this example, I put demoardhan since that is the name of my dataset.|
3. Your Project ID in Cloud DLP Project ID field
4. Your De-identification template full name (Go back to your DLP menu, click your De-identification template and copy full name under the name of your template. It has format like this : projects/[project-id]/locations/[location]/deidentiftTemplates/[template-name]

5. Bucket location and filename prefix to save temporary files that has been masked before Dataflow send them to BigQuery. These files are just temporary so you won’t see them after the transformed data is pushed to BigQuery.

This is my configuration example :

Then, click Show Optional Parameters. Under Cloud DLP inspect Template Name, type in your Inspection template full name (same format with Deidentification template full name). Below is the example of my inspection template full name

Then, click Run Job to start the job.

Result

The Dataflow job will take some time to push data to BigQuery. In my case, it took around 5 minutes until I see the transformed data in BigQuery dataset. We can see the job progress by going to Dataflow Job that we just create, and click Show in Logs section. There are two kind of logs, Job Logs and Worker Logs. See the progress in these logs.

For example, in the Worker Logs I see below log, meaning that Dataflow has found my 3 .csv sample files within my bucket, and it will poll the data for every 30000 ms (30 seconds) as long as this Dataflow job is running. This is how Dataflow processing live streamed data.

In BigQuery dataset, I get 3 new tables, called valid_sample, invalid_sample, and mixed_sample, since each one of .csv files that are processed is written as one table.

We can query each table in BigQuery using standard SQL expression. Click Action button in the table that you want to query, and type in basic SQL query to query the table content. For example, this is my query syntax for valid_sample : SELECT * FROM `service-project-ardhan-1.demoardhan.valid_sample` LIMIT 1000.

Below are the content of each table :

As you can see from above pictures, all of them provides correct results :

DLP encrypt all NIK in valid_sample table, since those NIK are valid NIK. DLP encrypt only one NIK in mixed_sample table and leave other NIK as is, since only one valid NIK in mixed_sample. DLP didnt encrypt any NIK in invalid_sample, since all those NIK are invalid.

To test that this solution will deidentify data live, we can try to create new sample, fourth sample called live_data.csv containing 2 entries, one with valid NIK and one with invalid NIK.

Then we can try to upload it to GCS Bucket. This csv should be ingested by Dataflow right away then transformed into tokenized version, and pushed into BigQuery dataset as new table (live_data table) without we have to recreate the Dataflow job.

New .csv file (live_data.csv) has been uploaded into bucket.

Dataflow log, stating it detect 1 new .csv files (returned 4 results, of which 1 were new).

Note : If you run this demo in your own lab / demo account, make sure to stop the Dataflow jobs after you finish, so it stops charging you.

More info in Dataflow pricing, you can go to https://cloud.google.com/dataflow/pricing.

TL;DR

By combining Cloud DLP and Cloud Dataflow, we can get continous data inspection and de-identification for our sensitive data in real-time fashion, providing more secured environment with less manual works.

Data Masking with Tokenization using Google Cloud DLP and Google Cloud Dataflow

What

Why

How

Result

TL;DR

Written by Ardhanyoga