Data Loss Prevention: De-identification with Dataflow Templates and Cloud Composer

Suhita Goswami
2 min readJul 10, 2020

The Data Loss Prevention (DLP)API is a multi-faceted tool on the Google Cloud Platform that allows user to inspect and mask sensitive data. Given that identifiable data points such as Names, Email, Addresses, etc require to be handled with utmost sensitivity, the DLP API makes the safe management of PII/PHI data cohesive with your Cloud environment.

The De-identification job can be executed using a Dataflow template to write raw files in GCS to BigQuery. The template takes the de-identification DLP template, the GCS file location and the BigQuery dataset to write to are parameters.

We will be walking through setting up a de-id template and running the de-identification job using Cloud Composer as an orchestration tool.

De-identification templates can perform data redaction, masking and/or bucketing. De-identification templates can also perform hashing/encryption using Cloud KMS keys to generate data encryption keys. Depending on the method of implementation, the hashed values can either maintain their uniqueness for data integrity should it be required for joining tables.

The sample dataset I’ve generated involves two datasets, person_data and person_key.

Person_key has FirstName, LastName, TelephoneNumber and person_id.

Person_data has The customer details except person_id.

Condition_data has diagnosis codes and information against a specific person_id.

De-identification Template:

We created a template which tokenizes person_id, redacts FirstName and LastName, and masks the last 7 digits of TelephoneNumber.

Dataflow De-identification Template

The De-identification dataflow template takes the de-identification template generated in the project as an inpute parameter. It also takes the GCS path for the input raw files and the batch size for the DLP API call as required parametes for the job.

Cloud Composer

The DAG in your composer environment should have the following Dataflow template operator. It must be kept in mind that the de-identification template must be passed in the format : projects/<project-id>/deidentifyTemplates/deid-demo without ‘/locations/’. The dataset in BigQuery must exist prior to DAG execution.

start_template_job = DataflowTemplateOperator(
# The task id of your job
task_id="dataflow_operator_deid-transform_csv_to_bq",
# The name of the template that you're using
# For versions in non-production environments, use the subfolder 'latest'
# https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#gcstexttobigquery
template="gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery", # Use the link above to specify the correct parameters for your template. parameters={
"inputFilePattern": "gs://<project>-sample-landing/person_data.csv",
"dlpProjectId":"<project-demo-sandbox>",
"deidentifyTemplateName": "projects/<project-demo-sandbox>/deidentifyTemplates/deid-demo",
"datasetName":"deid_composer",
"batchSize":"500"
}
)

The DLP de-identification job is a streaming Dataflow job which can be found on the Dataflow job pane. The Console UI for the job will depict parameters with regards to the de-identification tokenization in the right side pane.

Note: This example is not taking private IP into account. Private IP requires reconfiguring and re-compiling the DLP Template code to include a private IP flag.

--

--

Suhita Goswami

Steminist in the Cloud| Strategic Cloud Engineer @ Google