Discovering and Classifying Your Data with GCP’s Sensitive Data Protection (DLP and Terraform!)

Jada Young
Google Cloud - Community
6 min readAug 6, 2024

In conversations with a retail media client, we realized that advertisers in the digital marketing space face a complex challenge: using data to benefit clients and engage consumers, while respecting users’ privacy. This balancing act raises a key question: How can data be leveraged to craft compelling content that adds value for consumers and successfully promotes products, all while safeguarding users’ privacy and the brand’s reputation?

And it’s a fair question to ask, considering the unique combination of challenges surrounding data collection and reconciliation faced by those in the digital advertising space:

A table illustrating three key challenges in digital advertising: Disparate Sources: Gathering data from diverse sources with varying practices makes it difficult to assess data quality and compatibility. Personalization Demands: Balancing consumer desire for personalized content with their privacy concerns. Growing Compliance Landscape: Navigating a complex and evolving regulatory landscape across multiple jurisdictions poses resource challenges.

For businesses invested in digital ad campaigns, addressing these considerations can feel overwhelming at first glance — but it doesn’t have to be. GCP offers a robust suite of tools designed to simplify data governance, privacy, and risk mitigation. In this article, we’ll explore how GCP can be leveraged to understand and control your data, freeing you to focus on what matters most: delivering personalized customer experiences that drive results.

While these solutions are particularly relevant for retail media, the insights and strategies shared here can be valuable for any organization striving to balance personalization with data protection and regulatory compliance.

Cloud Data Loss Prevention (DLP)

Let’s focus on a critical first step in any data governance strategy: understanding what sensitive data you must protect. We’ll dive into a key component of GCP’s Sensitive Data Protection (SDP) suite: Data Loss Prevention (DLP) and explore how it enables you to automatically discover, classify, and redact sensitive information. Plus, we’ll demonstrate how to seamlessly integrate DLP into your workflows using Terraform, making data protection a natural part of your infrastructure management.

DLP automatically discovers, classifies, and de-identifies a wide array of sensitive data (ranging from personally identifiable information (PII) to financial information and intellectual property) across your entire data landscape. With the flexibility to scan for any combination of up to 150 built-in or custom infoTypes, you can ensure comprehensive coverage for all of your sensitive data concerns.

Whether data resides inside or outside of Google Cloud, DLP offers easy-to-configure scans for structured text, unstructured text, and images. New users can start with a basic scan configuration to get a sense of their sensitive information, then further refine the scans over time based on identified risks and compliance needs.

For a quick overview of how to get started and DLP inspection , check out the video below:

Automating with Terraform

The GCP console offers a convenient and intuitive way to configure DLP scans, but as your environment grows, managing multiple projects may require a more scalable approach. Terraform provides a seamless way to create and manage configuration scans that can be directly incorporated into your Infrastructure as Code (IaC) practices, automating and further solidifying your security posture.

Let’s explore a practical example using a BigQuery table filled with typical e-commerce user data.

BigQuery table of customer data including ID, first name, last name, email, age, gender, and state.
Sample of BigQuery table with user data

We’ll use a Terraform script to automate the setup of a DLP discovery scan that analyzes this table for sensitive information like PII (names, email addresses, phone numbers) and financial details. Here’s a breakdown of each resource block:

  1. DLP Discovery Config: Sets the target for our scan — the BigQuery table containing the user data — and references the inspection template used to identify sensitive data.
# Create the DLP job
resource "google_data_loss_prevention_discovery_config" "basic" {
parent = "projects/${var.project}/locations/us"
location = "us"
status = "RUNNING" # Start immediately after config

# Define what data to scan
targets {
big_query_target { # Specify the target table
filter {
table_reference {
dataset_id = var.dataset_id
table_id = var.table_id
}
# other_tables {} If we wanted to scan all BQ tables use this instead
}
}
}
# Specify the inspection template
inspect_templates = ["projects/${var.project}/inspectTemplates/${google_data_loss_prevention_inspect_template.basic.name}"]
}

2. DLP Inspection Template: Specifies what types of sensitive information the scan should look for. In this case, we are targeting common PII like email addresses, names, phone numbers, and credit card information.

# Define the inspection template
resource "google_data_loss_prevention_inspect_template" "basic" {
parent = "projects/${var.project}"
description = <<-EOT
This template scans e-commerce user data for
Personally Identifiable Information (PII),
including names, emails, addresses, etc.,
and financial information (credit card numbers, etc.)
to ensure data security and compliance.
EOT

inspect_config {
info_types {
name = "EMAIL_ADDRESS"
}
info_types {
name = "PERSON_NAME"
}
info_types {
name = "PHONE_NUMBER"
}
info_types {
name = "CREDIT_CARD_NUMBER"
}
}

}

3. Pub/Sub Topic: Creates a communication channel for scan results. When the job finishes, it will send a message to this topic, allowing other systems or processes to act on the findings.

4. DLP Job Trigger: Schedules our DLP scan to run on a daily basis and specifies the actions to take upon completion. In this script, it’s configured to send a notification to the Pub/Sub topic when the scan is finished.

# Create a Pub/Sub topic to receive DLP job completion notifications
resource "google_pubsub_topic" "dlp_job_notification_topic" {
name = "dlp-job-notification-topic"
}


# Create a DLP job trigger
resource "google_data_loss_prevention_job_trigger" "trigger" {
parent = "projects/${var.project}"
description = "Trigger for scanning table ${var.dataset_id}.${var.table_id}"

inspect_job {
inspect_template_name = google_data_loss_prevention_inspect_template.basic.id



# Define what data to scan
storage_config {
big_query_options {
table_reference {
project_id = var.project
dataset_id = var.dataset_id
table_id = var.table_id
}
rows_limit = 1000
sample_method = "RANDOM_START"

}
}

# Trigger actions
actions {
pub_sub {
topic = google_pubsub_topic.dlp_job_notification_topic.id
}
}

}
triggers {
schedule {
recurrence_period_duration = "86400s"
}
}
}

To view the full script, click here.

In summary, this script automates daily scans of the specified BigQuery table for personally identifiable information (PII). If PII is found, an alert is sent to the designated Pub/Sub topic. Automating this process reduces manual effort for our data security team while ensuring ongoing protection of the customer data. We could further extend this functionality by creating a Cloud Function that subscribes to the Pub/Sub topic and trigger webhooks, service integrations, or even custom remediation workflows when sensitive data is discovered. Currently, Terraform also natively supports exporting scan results to BigQuery, Security Command Center, Cloud Storage, and Data Catalog, offering even greater flexibility in how you manage and analyze your DLP findings.

After running the script, we’re provided with an overview of our scan results:

A screenshot of Google Cloud Data Loss Prevention (DLP) job results showing 3,896 findings in 1,000 rows of data. The findings are categorized as person names (74.33%), email addresses (25.67%), and phone numbers (moderate sensitivity). Additionally, a few credit card numbers (high sensitivity) were also found.

As shown, our scan identified 3,896 instances of sensitive information within the dataset. These findings consist primarily of full or partial names and email addresses. The results are classified by the type of information (infoType) discovered and the corresponding sensitivity level, providing a clear overview of the data’s potential risk. Now that we’ve successfully discovered and classified our data, we can consider how to de-identify, mask, or redact the sensitive information, which can also be defined using Terraform.

By automating our sensitive data protection pipeline, we not only save time and resources but also improve accuracy, reduce risk, and strengthen our overall security posture. This empowers teams to focus on core business goals: using data-driven insights to foster innovation, craft impactful campaigns, and deliver the best personalized customer experiences.

Ready to dive deeper into Sensitive Data Protection and DLP? Check out this assortment of tutorials, quickstarts, and labs covering common use cases.

Next Steps & References:

If you found this content helpful, connect with me on LinkedIn to continue the conversation. Be sure to follow for more upcoming content on data protection and other GCP security topics!

--

--