Inspecting and Reporting Sensitive Data with Google Cloud’s DLP API

Published in

Google Cloud - Community

6 min readApr 1, 2019

tl;dr: I’m Zach. I’m new to blogging. Lets find and report PII in your GCP environment with the Google Cloud Data Loss Prevention API.

Howdy folks and welcome to my first blog post EVER! In the short time I’ve been at Google Cloud, I’ve learned at least a hundred new things every single day. From new open source technologies to connecting Google Cloud services in interesting ways, it’s like drinking from a fire hose sometimes. This helps me help customers innovate their business, but there are still so many people out there curious what services in the cloud could do for them. I’m hoping to tap into my previous life as a Calculus Tutor and distribute some of the knowledge I pick up along the way. Thanks for being a part of the journey!

Every organization has sensitive data they must protect: from addresses and credit card numbers to medical patient records and intellectual property, the list goes on. These types of info are typically referred to as personally identifiable information (PII) or protected health information (PHI). For businesses with this data, securing the network is just half the fun. What happens if a bad actor is already inside your network and starts poking around the data? You need some way of disguising, or obfuscating, the data in such a way that it’s unidentifiable to unauthorized eyes.

Enter Data Loss Prevention API

Google Cloud DLP API is a service that helps users identify and manage their sensitive data. The API has over 90 pre-trained data classifiers and natively integrates into other Google Cloud services like BigQuery and Cloud Storage. Using obfuscation techniques, we can achieve the below:

There’s already a great solution by my colleagues for setting up a classification pipeline using Cloud Functions. For this post, I’d like to take that solution a step further and show how we can report on sensitive data in our environment in Cloud Security Command Center (CSCC) and Data Studio once we find it.

Objectives

Configure Cloud Security Command Center to view findings and assets
Create BigQuery Dataset to store all DLP findings
Update the Cloud Functions you created in the classification solution to push to CSCC and BigQuery
Create Data Studio dashboard to surface DLP findings from BigQuery

What is Cloud Security Command Center?

CSCC is an all-in-one security and risk platform on Google Cloud that helps security teams gather data across their entire GCP organization to identify threats and take action before they end up on the front page of the New York Times. CSCC serves up an asset inventory of everything that exists in your org, shows you where PII resides in your GCP environment, and integrates with 3rd party tools like Redlock, Forseti, and Chef to detect instance vulnerabilities and compliance policy violations.

What is Data Studio?

Data Studio provides the ability to easily build interactive dashboards utilizing data from many different sources. Although the DLP API reports findings to Cloud Security Command Center, you most likely aren’t going to give every VP or director access to the GCP Console. Data Studio allows you to build informational dashboards you can surface to a broader audience that keeps them informed on the current state of PII existing in their cloud environment.

Inspection and Reporting Pipeline

You upload data to solutions like Cloud Storage, BigQuery, or Cloud Datastore. For now, we’ll focus on Cloud Storage
We invoke the Data Loss Prevention API to inspect the data on a schedule via a Cloud Scheduler job or whenever new data is pushed to GCP via a Cloud Function
The DLP API sends its findings to the Cloud Security Command Center and BigQuery
We build a dashboard in Data Studio to surface our findings from BigQuery

Configure Cloud Security Command Center to view findings and assets

Enable Cloud Security Command Center for your organization following instructions here

Note: in order to provision CSCC, you’ll need to be in a GCP organization. You can find more info for creating an org resource here, but feel free to skip to the next section if you’re not interested in setting one up.

Create BigQuery Dataset to store all DLP findings

Open the BigQuery UI
Under Resources, select your project ID, then select Create Dataset
Set Dataset ID to dlp_findings, leave the other default settings, and hit Create Dataset

Note: by default, BigQuery will create the dataset in the United States.

The DLP API will send all findings to tables within this dataset. For this example, we will use a BigQuery table per GCP project but it could make sense to use a table per folder.

Update the Cloud Functions you created in the classification solution to push to CSCC and BigQuery

Once you have walked through the classification solution, revisit the Cloud Functions you created and replace the code with this updated version. Lets walk through my updates:

PROJECT_ID = ‘[PROJECT_ID_FOR_DLP_FINDINGS]’DATASET_ID = ‘[DATASET_ID_FOR_DLP_FINDINGS]’TABLE_ID = ‘[TABLE_ID_FOR_DLP_FINDINGS]’

This is where we’ll set the location of the BigQuery dataset and table we created in the previous section. All our findings will go here.

{ 
  ‘save_findings’: { 
    ‘output_config’: { 
      ‘table’: { 
        ‘project_id’: PROJECT_ID,
        ‘dataset_id’: DATASET_ID, 
        ‘table_id’: TABLE_ID 
      }
    }
  }
}, { 
  ‘publish_summary_to_cscc’: {} 
}

This is where we are adding the new actions to push our DLP findings to CSCC and BigQuery.

Other updates you may choose to make yourself could be to expand our search radius and add the type ‘ALL_BASIC’ within INFO_TYPES.

After saving the updates, lets send up the same files to the quarantined bucket to kick off the Cloud Function trigger.

Create Data Studio dashboard to surface DLP findings from BigQuery

Now that we’ve scanned our data and stored our findings in BigQuery, it would be nice to have a visualization of what exists in our environment. We can create a dashboard in Data Studio to do just that.

Navigate to this link to see my example dashboard. Select the copy icon from the toolbar, shown below.

Note: if this is your first time using Data Studio, you’ll be asked to click through an agreement before continuing.

Next, we’ll select the new data source i.e the newly created BigQuery table with DLP findings.

Hit the drop down and select Create a New Data Source.

Search for BigQuery and authorize the connector.

Select the BigQuery Project, Dataset, and Table storing the DLP findings. Hit Connect, Add to Report, and Copy Report to finish creating the dashboard.

We should now have a nice graphical representation of the DLP findings that we can edit however we like.

Who knew finding PII could be so colorful?

To recap, we discussed why discovering PII/PHI in an environment is so important, compared the differences between CSCC and Data Studio dashboards, and modified the solution guide code for automating the classification of data to send DLP findings to BigQuery and Cloud Security Command Center. Keep in mind this is just a taste of what businesses must do to secure their data in the cloud. In order to build this functionality out to scale, we’d have to toss in things like pipeline automation (via Terraform, Ansible, etc), IAM/resource policy management (via Forseti, Redlock, etc) to ensure there is a safe and repeatable process across the organization, and obfuscation techniques to disguise the PII we might store in the cloud. Happy inspecting!

Inspecting and Reporting Sensitive Data with Google Cloud’s DLP API

Written by Zach Sais