Data Security in Google Cloud series — Part 1: Physical data encryption of sensitive data using Cloud DLP and KMS

Published in

Google Cloud - Community

17 min readJun 7, 2023

Many companies that are uploading their data to the cloud, are being required to do so securely and make sure that any data containing sensitive information such as Personally Identifiable Information (PII) is properly handled and de-identified in the cloud by using different data securitization techniques like tokenization, masking or encryption. The use of encryption is becoming increasingly important as businesses and individuals collect and store more sensitive data. Encryption can help to protect this data from unauthorized access, use, or disclosure. As the number of governments around the world enforcing these rules through legislation increases, adopting a practice of inspecting data for PII and properly handling and securing it in the cloud becomes of paramount importance. This article provides an example of how to use Google Cloud’s Data Loss Prevention and Key Management System services within a data pipeline created with either Cloud Data Fusion or Dataflow, to inspect a field in a file in Cloud Storage for a National ID (in this case a Chilean National ID or “Cédula De Identidad”), and encrypt those IDs prior to inserting them into a table in BigQuery. The data is physically encrypted in the table so that the sensitive information is secured even in the unlikely event of a data exfiltration, and can later be reversed to its original value, by using the same KMS key.

Overview

Data Loss Prevention (DLP) is Google Cloud’s Data Loss Prevention (DLP) is a suite of services that helps you find, classify, and protect sensitive data across your Google Cloud resources, on streams of data, structured text data stored in files or tables in BigQuery, even on unstructured data such as images. DLP can help you comply with data protection regulations, prevent data breaches, and ensure the security of your data.

DLP has a variety of features to help you protect your data, these features include:

Inspection: DLP can scan your data for sensitive data using a variety of methods, including dictionaries, regular expressions, and contextual elements.
Classification: DLP can classify sensitive data into categories, such as financial data, PII data, and intellectual property.
De-identification: DLP can remove sensitive data from your data without impacting its utility. This can be done by redacting, masking, or encrypting the data.

DLP uses a variety of methods to identify sensitive data, including:

Pattern matching
Machine learning
Custom dictionaries

Also, Google Cloud DLP provides a predefined set of categories of sensitive information that can be used to create DLP policies, that are called Infotypes. Google has already setup over 150 infotypes that can identify sensitive data including things like credit card numbers, Social Security numbers, National IDs, and email addresses and provided for you out-of-the-box.

Once sensitive data is identified, DLP can take a variety of actions to protect it, such as data redaction, format-preserving and other types of encryption, data masking, security policies, etc.

Cloud Key Management Service (KMS) is Google’s cloud-based key management service that makes it easy to manage encryption keys for your cloud resources. KMS provides a single place to store, rotate, and audit your encryption keys, and it integrates with other Google Cloud services to make it easy to use encryption in your applications. Cloud Key Management, allows the creation and use of Software based keys that can be managed automatically for you by Google, or keys managed by you (Customer-managed Encryption keys or CMEK), Hardware based keys (with HSM as a service), and external key managers to import keys created and managed on a third-party key management system and use them in GCP.

Cloud Data Fusion is Google Cloud’s native low/no code solution for creating ETL/ELT pipelines at any scale. It makes it easy to create and run data wrangling and transformation pipelines by abstracting all the complexity of writing code to run on top of Cloud Dataproc cluster and providing the user with a point and click visual user interface instead.

Dataflow is a fully managed service that provides a unified experience for both batch and streaming data processing. It uses Apache Beam, a unified model for large-scale data processing, to run your pipelines on Google Cloud Platform. Dataflow abstracts away the complexity of managing and scaling infrastructure, so you can focus on writing your pipelines.

Both Data Fusion and Dataflow services described above can leverage the KMS and DLP APIs to create cloud-native, secure data pipelines, with built-in data scanning and data de-identification of sensitive data.

What we will cover

Learn how to use Cloud Data Loss Prevention to Inspect and Redact sensitive data by leveraging Inspection and De-Identification templates.
Learn how to use a KMS-generated symmetric encryption key to encrypt the sensitive data detected by DLP.
Create a pipeline that reads data containing sensitive information from a file in a Cloud Storage bucket, inspects that data for sensitive information defined in a DLP Inspection template, encrypts the data with an encryption key determined by a DLP De-identification template, and loads the data with the encrypted column in a BigQuery table using both Cloud Data Fusion and Dataflow. For this exercise we will create two data pipelines that perform the same task, a batch pipeline in Data Fusion and Streaming job in Dataflow just to exemplify how to perform this task with any of these tools. For your particular use case, you may choose which Data transformation tool to choose.

This is how the architecture for this solution looks like:

GCP Services used for the solution:

Google Cloud Data Loss Prevention.
Google Cloud Key Management Service.
Google Cloud Data Fusion.
Google Cloud Dataflow.
Google Cloud Storage.
Google BigQuery

Setup and requirements

Note: A working instance of Cloud Data Fusion is required for the later part of this exercise. You may find an example of how to create a Private instance in a previous article I published here: https://medium.com/google-cloud/connect-private-data-fusion-instance-with-a-private-cloud-sql-instance-using-cloudsql-proxy-caddf4795ac6

You only need the Data Fusion instance, no need to create the Cloud SQL instance or the SQL proxy mentioned in that article for this exercise.

Run the following command in Cloud Shell:

gcloud services enable cloudkms.googleapis.com \
cloudscheduler.googleapis.com \
datafusion.googleapis.com \
dataflow.googleapis.com \
compute.googleapis.com \
datapipelines.googleapis.com \
dlp.googleapis.com

Set initial environment variables by running the following command in cloud shell:

export PROJECT_ID=$(gcloud config get-value project)

Steps:

Create a bucket in GCS and upload a file with a column containing sensitive PII data (In this case it’s Chilean Nation ID numbers, called “CHILE_CDI_NUMBER” infotype in DLP)
Create KMS keyring and key (Using CMEK)
Create the Inspection Template in DLP referencing the infotypes you want DLP to detect.
Create a wrapped key with the KMS key and create the De-Identification Template in DLP referencing that KMS key.
Use the Dataflow template to redact sensitive data in a file within a GCS bucket, inspect it using the Inspection Template, redact the detections by encrypting the data using the KMS encryption key, and uploading the results to a BigQuery table.
Perform the same tasks as Step #5, but using the DLP plugin in Data Fusion.

1) Bucket in GCS and file to be de-identified:

We just create a storage bucket named “datos_dlp” (remember the bucket name must be unique across GCP ) that will hold the CSV file we will use in the example:

gsutil mb gs://datos_dlp/

The sample file is a simple .csv file containing a list of Chilean public, government owned companies, and contains 3 columns:

CDI: Chilean National ID (In Chile both, citizens as well as private and public companies have a National ID called “Cédula nacional De Identidad”).
NOMBRE: Name of the public company
VIGENTE: Boolean field indicating whether the company is still valid.

The file is public and can be found here:

Listado de Empresas públicas

Listado de Empresas públicas //www.cmfchile.cl/institucional/mercados/consulta.php?mercado=O&Estado=VI&consulta=EMPUB…

www.cmfchile.cl

Bucket creation in Google Cloud Storage and sample file upload

2) Key ring and encryption key creation in KMS:

Next, we navigate to the Cloud KMS section in GCP, under the Security menu:

Enable the API (if you haven’t done so already):

Create the key ring, we will name it “dlp_demo_keyring”:

Create the key within the newly created keyring. For the exercise we will create a key with Software protection level, and with purpose Symmetric encrypt/decrypt leave the rest of the options by default:

Make sure to store the key resource name, you will need it later in the DeId template creation in DLP (usually it is in the form): projects/[YOUR_PROJECT_ID]/locations/[region]/keyRings/[YOUR_KEYRING_NAME]/cryptoKeys/[YOUR_KEY_NAME]:

In my case it’s:

projects/dlp-demo-338019/locations/global/keyRings/dlp_demo_keyring/cryptoKeys/dlp_key

3) Create the Inspection Template:

Enable the DLP API (if you haven’t done so already):

In DLP, under the Security menu, head to CONFIGURATION→ TEMPLATES→ INSPECT:

Click CREATE TEMPLATE:

Provide a Template ID and Display name and Hit CONTINUE:

Click on MANAGE INFOTYPES and select the built-in or custom info-types you want to use for the inspections (in this example we’re using the built-in “CHILE_CDI_NUMBER” infotype, in this case for DLP to make sure DLP picks up any value that might be flagged as a Chilean CDI, we set the Likehood to “Unlikely” so that the sensitivity of DLP is high enough to pick up any Chilean ID, regardless of the context, otherwise, if we set it to anything above “Likely”, we would have to provide context to DLP when providing an unformatted Chilean ID, this setup is particular to this infotype ) :

Search for “CHILE_CDI_NUMBER” in the list of built-in infotypes, then click DONE:

Click DONE (remember to set the likelihood correctly, this is how it should look like), and then click on CREATE:

You Inspection template has been created, now we can use this template to look for Chilean IDs anywhere in this project:

4) Create the wrapped key and the De-Identify template:

Now we need to create the De-identification template in DLP, to tell it what we want it to do with the findings we defined in the inspection template.

Head to CONFIGURATION→ TEMPLATES→ DE-IDENTIFY and click on CREATE TEMPLATE:

Select the De-identify Template type from the drop down:

Set the Template ID and Display name (in this case rut_deid) and hit CONTINUE:

Select Pseudonymize (cryptographic deterministic token) in the Transformation drop down:

Now you’re going to need the resource name of the key created earlier, paste it in the Crypto key resource name field under Key Options:

In the wrapped key field, you need to put in a key that’s been encrypted by the encryption key created earlier in Cloud KMS, we will generate that in the Cloud Shell now:

export string=`openssl rand -base64 24`

Encrypt the created base64 key with the dlp_key created and stored in KMS and store the output to a file, run the following command in cloud shell:

echo -n $string | gcloud kms encrypt --location global --keyring dlp_demo_keyring --key dlp_key --plaintext-file - --ciphertext-file ./ciphertext

Now encode it to base64 and write the wrapped key to a file named “wrappedKey” from where you can copy it later with vim:

base64 -w 0 ciphertext > wrappedKey

Now you can open the wrappedKey file with vim and copy its contents and paste it in the Wrapped key field in the DLP DeID template creation wizard.

Copy the wrappedKey file content and paste it in the Wrapped key field under Key options. In the Surrogate infoType field you may put a value that’s going to be prepended to the encrypted field, we’ll use RUN:

Now we have to select the infoTypes we want to encrypt, select the Specify infoTypes radio button, then click on MANAGE INFOTYPES:

Select the proper infoType (same as the previous step, “CHILE_CDI_NUMBER”) and hit DONE:

Review your configuration and click CREATE:

You have successfully created the De-identification template, you may test it by going to the TEST tab in the template details and providing a valid Chilean ID in clear text and if everything works properly, you should see it being encrypted on the fly:

5) Create Dataflow job to run De-identification job on the sample file stored in GCS and store the output in BigQuery:

Note on IAM roles required for the pipelines to run successfully:

For the Data Fusion pipelines make sure the Service Account used by DataFusion and Dataproc (Compute Engine service account is used by default and for simplicity in this exercise, which is not a recommended practice for production environments) have the required DLP roles (DLP Administrator role) :

https://cloud.google.com/data-fusion/docs/how-to/using-dlp

These are the roles that the Service Account used by Dataproc should have in order to execute both the Dataflow job as well as the Data Fusion pipelines correctly (again, in this case we used the default Compute Engine service account for both Dataflow and Dataproc, which is not necessarily recommended for production deployments):

Pre-requisites:

Dataset created in BigQuery
Compute Engine API enabled
Dataflow API enabled
Service Account with sufficient permissions to run Dataflow jobs

Create the Dataset in BigQuery (don’t worry about the Table, Dataflow will create it for you):

Head to the Dataflow section and enable the required APIs in Dataflow if you have not done so already:

Create Dataflow job from dataflow template:

Name it and select the template Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP):

In the template parameters set the bucket location where the sample data resides, the BigQuery dataset name, the project ID where the Cloud DLP deid template was created (in this case, the same project we’re working on), a path to a Cloud Storage bucket for Dataflow temporary processing and the Cloud DLP deidentify template resource name, you can find that in the DeID definition in the DLP section:

Scroll down and expand the OPTIONAL PARAMETERS sections, you also need to provide the Cloud DLP inspect template name we created in step 3:

Click RUN JOB.

After a few minutes you should see that a table has been created in the dataset you provided earlier with the same schema of the input file and in which the values in the first field (the field containing Chilean IDs) have been encrypted:

Note that this Dataflow template creates a streaming job that is constantly listening to new files in the input bucket location. Now you can just drop new files in that bucket, and Dataflow will scan the files with DLP and encrypt any Chilean IDs it finds prior to uploading them to BigQuery.

Permissions NOTE:

Workflow failed. Causes: Permissions verification for controller service account failed. All permissions in IAM role roles/dataflow.worker should be granted to controller service account 966479050221-compute@developer.gserviceaccount.com.

If you get an error like the one above, make sure the Compute service account has been granted the proper roles (in this case the roles/dataflow.worker so that )

Permissions NOTE:

If you’re getting the warning specified below in the dataflow job logs, you may find that the dataflow job is hung and not processing any data:

The network default doesn’t have rules that open TCP ports 12345–12346 for 
internal connection with other VMs. 
Only rules with a target tag ‘dataflow’ or empty target tags set apply. 
If you don’t specify such a rule, any pipeline with more than one worker 
that shuffles data will hang. 
Causes: No firewall rules associated with your network.

To resolve this, create a FW rule to allow internal connectivity between all the instances in your VPC:

6) Creating a Cloud Data Fusion pipeline to find and encrypt RUTs.

Install the DLP plugin in Cloud Data Fusion:

In the Data Fusion UI, Click on the HUB link in the top right side menu, and then →Plugins , search for Data Loss Prevention in the search bar:

Click on the Deploy button, and then hit the Finish button, no need to upload any .jar file.

Once finished, create a new pipeline:

From the source section, add GCS to the canvas, point to the raw file in the GCS landing bucket and rename the output columns:

If you installed the DLP plugin correctly, you will see 3 new components available for use under the Transform section, Google DLP Decrypt, Google DLP Redact and Google DLP PII Filter, for this pipeline, add the DLP Redact component to the canvas and connect the GCS component to this newly added DLP Redact component:

Set the DLP Redact properties:

Enable the Use custom template flag, in the Template ID field paste the name of the DLP inspection Template created earlier.

In the Matching section, set the Fields to Transform sub-section to the following:

Apply: Determinist Encryption → On: Custom Template → within: RUN

Once we select Custom Template from the “on” dropdown, it will change to “None”, and the RUN field, is the field in the input file we will tell DLP to inspect with our custom inspection template (in this case we know this is the field that contains Chilean National IDs on it, so we don’t need to inspect additional fields for this infotype).

In the next section, select KMS Wrapped Key as Crypto Key type, and set the corresponding Wrapped key, KMS resource ID and Surrogate type name per our previous configuration in the DEID template creation:

Note: You can copy the the KMS resource ID of the KMS key by entering the Keyring, and Copying the resource name from the Actions Menu in the Key:

Finally, add a BigQuery component from the Sink menu to the canvas and connect the DLP Redact component to it:

Select the dataset and table name for the output data to be written to:

Provide a name for the pipeline and save it, deploy it and run it (you may choose to test it first with Data Fusion’s preview feature), in this exercise we are naming the output table as “DataFusionRedactedRUNs”.

If you run a preview of the pipeline, you may confirm if the records are being encrypted as expected by using the Preview Data functionality of the DLP Redact component:

Once confirmed, you may deploy the pipeline and run it, and confirm that the output table has been created in BigQuery and loaded with the expected encrypted field:

Congratulations!

You have securely uploaded a file containing sensitive information to a BigQuery table by creating a batch and a streaming job using both Dataflow and Cloud Data Fusion, which referenced cloud DLP and KMS to inspect the input file searching for Chilean National IDs using the Chile CDI infotype in DLP and encrypted the sensitive data with an encryption key you created in KMS.

What we covered

Learn how to create an Inspection template in Cloud Data Loss Prevention to Inspect an input for sensitive information using a pre-defined infotype.
Learn how to create a keyring and a symmetric encryption key in Cloud KMS.
Learn how to create a De-identification template in Cloud Data Loss Prevention that uses the KMS-generated key to encrypt the sensitive data detected by the DLP inspection template.
Create a pipeline that reads data containing sensitive information from a file in a Cloud Storage bucket, inspects the file with the Cloud DLP inspection template looking for Chilean National IDs, encrypts the findings using the De-Identification template, and uploads the data with the encrypted field to a BigQuery table using both Cloud Data Fusion and Dataflow.