Data Masking with Tokenization using Google Cloud DLP and Google Cloud Dataflow

Ardhanyoga
11 min readMay 25, 2022

--

What

Cloud DLP is one of Google Cloud’s tools to provide inspection and de-identification upon any data which suspected to contain any sensitive information. In most cases, it is used to prevent any PII data from being accessed by external parties.

Cloud Dataflow is one of Google Cloud’s data pipeline tools to provide ETL pipeline for data transformation. It supports autoscaling and batch processing. In most cases, it is used to ingest live, streamed data and process it to be transformed into useful information.

In this post, we will try to automate data masking process using DLP by leveraging Cloud Dataflow as its transformation tool.

Why

In most companies, they feel the need to protect / hide any sensitive information contained within their data. This came from the need to protect their customer’s information and / or complying any regulation (eg. HIPAA, ISO27000, etc).

Cloud DLP provide solution to address this problem. Basically, Cloud DLP provide 3 main features : Data inspection; to identify any sensitive information within document / image, data deidentification; to mask any sensitive information within document so those information is hidden; and data re-identification risk measurement; to analyze sensitive data to find properties that might increase the risk of subjects being identified, or of sensitive information about individuals being revealed. To address problem stated earlier, we will focus in the first two of Cloud DLP features : data inspection and de-identification.

Cloud DLP uses InfoType to recognize any sensitive data that might be found within the documents. InfoType is a set of characteristics that matches the format of specific type of data. For example, there is InfoType to recognize any data that most likely is Passport ID number based on its unique format. Currently, Google Cloud DLP hundreds of Infotype to be used within Cloud DLP. We also can create our own custom Infotype using regex if data that we want to identify has a format that is not supported by Google’s DLP current list of Infotypes.

Example of Google Cloud DLP InfoType

The process of inspecting and de-identifiying the data in Cloud DLP basically composed of these steps :
1. Provide the document to inspect
2. Run Cloud DLP job to inspect it
3. Cloud DLP will then identify and mask any sensitive data based on InfoType included in Cloud DLP job configuration

While the steps above is simple, the problem arises when we have to continuously inspect and mask any incoming live data. We need some automation to continuously ingest any live, inputted data to be sent to Cloud DLP, then publish the de-identified data somewhere else. Google Cloud Dataflow can address this problem.

How

Below is the architecture that we will try in this post :

Architecture Diagram

The high level of what will happen are these :

  1. Data is uploaded to Google Cloud Storage. We will prepare multiple .csv files containing sensitive and not-sensitive data.
  2. These .csv files then are ingested into Cloud Dataflow pipeline.
  3. In Cloud Dataflow, we create pipeline that will ingress any .csv files from Google Cloud Storage, send them to Cloud DLP for data de-identification, then publish them as table containing de-identified data in BigQuery
  4. Dataflow send the files into Cloud DLP.
  5. Cloud DLP then perform data inspection and de-identification towards those files. In Cloud DLP, we specifiy Inspection and De-identification templates containing what kind of data need to be inspected and de-identified, and how to de-identify it. We will use tokenization using Google Cloud KMS to provide encryption key to de-identify the data.
  6. Cloud DLP then return de-identified data to Cloud Dataflow.
  7. Cloud Dataflow send them to BigQuery. One .csv file should be exported as one table in BigQuery.

And these are how we do it :

I have created 3 .csv sample files, containing sensitive and non-sensitive informations. In this example, I will try to use DLP to inspect any data containing Indonesia’s Single Identity Number (Nomor Induk Kependudukan / N.I.K). NIK contains 16 numeric digits, started with 317508, followed by the person’s birth date-month-year and region code. For example : 3175082109050005. You can download these 3 csv files here : https://github.com/ardhnyg/medium_post_week4_may.

First csv file is valid_sample.csv. It contains list of all valid NIK.

valid_sample.csv

Second csv file is invalid_sample.csv. All NIK within this file are in the wrong format.

invalid_sample.csv

Third csv file is mixed_sample.csv. It contains only one valid NIK, while the other NIK within this file is in the wrong format.

mixed_sample.csv

Then I uploaded all these files in Google Cloud Storage. I have a bucket called bucketdemoardhan which will I use as a source for Dataflow ingestion.

Sample files in bucket

After the source is ready, then I need to create template in Cloud DLP to inspect and de-identify the data. Since I want to de-identify the data using tokenization, first I have to create encryption key in Google Cloud KMS then wrap it. I have created key ring in Google Cloud KMS named dlp-key-ring and a key in that key ring called dlp-key. We can create key ring in Cloud KMS by going to Google Cloud Console -> Navigation Menu -> Security -> Key Management -> Create Key Ring. After you create a key ring, click your key ring, and click Create Key.

Key Ring
Key

We can’t use the key yet, until we wrap it using a wrapper key, because DLP de-identification using tokenization will use that wrapper key. In this example, I am using openssl method in wrapping my key through Google Cloud Shell, but you can use any other method.

In Cloud Shell, I do this :

Create AES 256 key openssl rand -out "./aes_key.bin" 32

Encode it in base64 base64 -i ./aes_key.bin

You will get an output containing random base64 character in a string. Save this value, since this is base64-encoded key that will be wrapped. For example, what I get is kvnzjh56=+1ca/lgx+

Then, use Cloud KMS key that we have created to wrap base64 key.

curl "https://cloudkms.googleapis.com/v1/projects/service-project-ardhan-1/locations/global/keyRings/dlp-key-ring/cryptoKeys/dlp-key:encrypt" \
--request "POST" \
--header "Authorization:Bearer $(gcloud auth application-default print-access-token)" \
--header "content-type: application/json" \
--data "{\"plaintext\": \"kvnzjh56=+1ca/lgx+\"}"

Remember to change the value in bold text with your Project ID, Key ring name, key name and base64-encoded key respectively.

You will get result like this :

{
"name": "projects/service-project-ardhan-1/locations/global/keyRings/dlp-keyring/cryptoKeys/dlp-key/cryptoKeyVersions/1",
"ciphertext": "S/wD/lyyi0kLN8W+iXG5WkgpD9TV7mCoaEASwaquupm1bUZ3UQK20kXCvnD8UY1WTKG5Z0BSA5JuLbHXZ3uOHh8OMo=jj3sadqEhq7aOPfPBoQCGYWkZHx2sF63oYl8BH9IE2w8YGHyrs8bJZddDCJmD",
"ciphertextCrc32c": "901327763",
"protectionLevel": "SOFTWARE"
}

Save the string in bold, as we will use this string as Wrapped Keyin our DLP De-identification template.

After we create wrapper key, time to configure our DLP. Basically, we have to create two things in DLP for this demo : Inspection template, contains what kind of InfoType that has to match with sensitive information within the sample, and Deidentification template, containing how to transform the data so that the sensitive information cannot be extracted.

To create Inspection template, we can go to Navigation Menu -> Security -> Data Loss Prevention -> Configuration -> Templates -> Inspect -> Create Template. Type in your Inspection template ID, name and make sure for InfoType, you choose INDONESIA_NIK_NUMBER. Choose the likelihood of the data to match this InfoType; you can choose one of multiple values : Very Unlikely, Unlikely, Possible, Likely, Very Likely, with the order from more likely to trigger false positive to most unlikely. I choose Possible in this example. If you want to identify other kind of data, you can choose other type of InfoType. You can put multiple InfoType within the same template to identify multiple kind of sensitive informations within the same document. In this example, I create an Inspection template called inspect-nik.

Inspection Template Definition
Detection configuration

For De-identification template, we can go to Navigation Menu -> Security -> Data Loss Prevention -> Configuration -> Templates -> De-Identify-> Create Template.

Define the template name and Data tranformation type; InfoType meaning your data will be treated as free-form text or Record to treat your data as structured data. In this example, I choose InfoType. In this example, I create a template called deid-template.

Deidentification Template Definition

Still in the same wizard, in Configure de-identification section, choose the Transformation method. Since we want to tokenize the sensitive data using encryption key from Google Cloud KMS, choose Pseudonymize (cryptographic deterministic token). In Key options, choose KMS wrapped crypto key and specify your KMS key and Wrapped key that you have created in previous steps. For Surrogate infoType, type in character that will be embedded in the starting string of your tokenized data. For example, I type in “NIK”, so any transformed tokenized data that I will see, I can identify it with character “NIK” as its starting character. In InfoTypes to transform, choose Specify infoTypes and choose INDONESIA_NIK_NUMBER.

Transformation method, KMS key, wrapped key and surrogate character(s)
InfoTypes to transform

As you can see in the example above, the transformed sample will always starts with “NIK”, as I put “NIK” in the Surrogate field.

In this point, Cloud DLP configuration is finished. Before put it in Dataflow configuration, we have to create BigQuery Dataset as the location of transformed data, pushed by Dataflow. We can go to Navigation Menu -> BigQuery (in Analytics Section) -> SQL Workspace. Click your Project ID in the BigQuery, click 3 dots icon (action), and choose Create dataset. In this example, I have created a dataset called demoardhan.

Create BigQuery Dataset

Now it’s time to put them all together in Cloud Dataflow. Go to Navigation Menu -> Dataflow (in Analytics section).

Thankfully, Dataflow has many built-in template that we can use, so we dont have to create the pipeline manually. One of the template is called Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP).

In Dataflow window, click Create Job From Template. In Job creation window, specify the job name (in this case, my job name is dataflow-dlp-deid-demo), and choose Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP) template.

Dataflow JobTemplate Selection

Still in same configuration window, in Required parameters section, type in :
1. gs://[yourbucketname]/*.csv in Input Cloud Storage File(s) field, since we want to process any csv files (hence, *.csv) contained in the bucket. In this example, I put gs://bucketdemoardhan/*.csv.
2. [Your-BigQuery-dataset-name] in BigQuery dataset field. In this example, I put demoardhan since that is the name of my dataset.|
3. Your Project ID in Cloud DLP Project ID field
4. Your De-identification template full name (Go back to your DLP menu, click your De-identification template and copy full name under the name of your template. It has format like this : projects/[project-id]/locations/[location]/deidentiftTemplates/[template-name]

Template full name

5. Bucket location and filename prefix to save temporary files that has been masked before Dataflow send them to BigQuery. These files are just temporary so you won’t see them after the transformed data is pushed to BigQuery.

This is my configuration example :

Dataflow required parameters

Then, click Show Optional Parameters. Under Cloud DLP inspect Template Name, type in your Inspection template full name (same format with Deidentification template full name). Below is the example of my inspection template full name

Inspection template full name
Dataflow optional parameters

Then, click Run Job to start the job.

Result

The Dataflow job will take some time to push data to BigQuery. In my case, it took around 5 minutes until I see the transformed data in BigQuery dataset. We can see the job progress by going to Dataflow Job that we just create, and click Show in Logs section. There are two kind of logs, Job Logs and Worker Logs. See the progress in these logs.

Dataflow jobs progression

For example, in the Worker Logs I see below log, meaning that Dataflow has found my 3 .csv sample files within my bucket, and it will poll the data for every 30000 ms (30 seconds) as long as this Dataflow job is running. This is how Dataflow processing live streamed data.

Dataflow logs

In BigQuery dataset, I get 3 new tables, called valid_sample, invalid_sample, and mixed_sample, since each one of .csv files that are processed is written as one table.

BigQuery Table

We can query each table in BigQuery using standard SQL expression. Click Action button in the table that you want to query, and type in basic SQL query to query the table content. For example, this is my query syntax for valid_sample : SELECT * FROM `service-project-ardhan-1.demoardhan.valid_sample` LIMIT 1000.

Below are the content of each table :

valid_sample query result
mixed_sample query result
invalid_sample query result

As you can see from above pictures, all of them provides correct results :

DLP encrypt all NIK in valid_sample table, since those NIK are valid NIK. DLP encrypt only one NIK in mixed_sample table and leave other NIK as is, since only one valid NIK in mixed_sample. DLP didnt encrypt any NIK in invalid_sample, since all those NIK are invalid.

To test that this solution will deidentify data live, we can try to create new sample, fourth sample called live_data.csv containing 2 entries, one with valid NIK and one with invalid NIK.

live_data.csv

Then we can try to upload it to GCS Bucket. This csv should be ingested by Dataflow right away then transformed into tokenized version, and pushed into BigQuery dataset as new table (live_data table) without we have to recreate the Dataflow job.

New .csv file (live_data.csv) has been uploaded into bucket.
Dataflow log, stating it detect 1 new .csv files (returned 4 results, of which 1 were new).
New BigQuery table — live_data
live_data table content

Note : If you run this demo in your own lab / demo account, make sure to stop the Dataflow jobs after you finish, so it stops charging you.

Stop the job

More info in Dataflow pricing, you can go to https://cloud.google.com/dataflow/pricing.

TL;DR

By combining Cloud DLP and Cloud Dataflow, we can get continous data inspection and de-identification for our sensitive data in real-time fashion, providing more secured environment with less manual works.

--

--

Ardhanyoga

Any post and article expressed in this platform are those of my own and do not necessarily reflect the views or positions of any entities they represent.