BigQuery, PII, and DLP: The Perfect Match

How to generate random PII in BigQuery, run DLP on it and quickly see the results on DataStudio

Marcelo Costa
Google Cloud - Community
5 min readJan 27, 2020

--

Background by Kelli Tungay on Unsplash

There are so many regulations nowadays — CCPA, GDPR, HIPAA — regarding information protection, and many new ones to come, that it’s hard to be up to date with everything that comes with it. We can’t ignore it anymore, hence finding a way to deal with it is a must.

Google Cloud Data Loss Prevention helps you scratch the surface on what those regulations are about, by automatically discovering sensitive data everywhere.

Data Loss Prevention (DLP) in general is a broad field and has a lot of great features. We won’t be covering the details of DLP here but there is plenty of excellent literature upon it found elsewhere. What we will look at is the Google Cloud DLP and BigQuery integration.

Disclaimer: All opinions expressed are my own, and represent no one but myself…

Load PII data into BigQuery

To begin with, we need some Personally Identifiable Information (PII) loaded into BigQuery.

And to help with that, I’ve developed a Python script that relies on the Faker library, to generate random PII.

The script is licensed under the MIT license, so feel free to get it and run according to the instructions. Contributions are welcome, by the way!

Command-line args

It’s a command-line tool that allow users to choose the number of columns and rows to be created, the BigQuery dataset and table names.

Let’s look at the main code:

Some lines were suppressed for better readability
  • line 8: creates a Pandas DataFrame containing the PII with the provided args;
  • line 12: creates a CSV file from the DataFrame representation;
  • line 17: uploads the CSV file to a temporary Storage Bucket;
  • line 22: creates the BigQuery Dataset if it does not exist;
  • line 23: creates a BigQuery load job, to send the contents from the uploaded file to a new Table and waits for it to finish;
  • line 24: cleans up the temporary Storage Bucket;

Executing the script

# Command line execution
python main.py \
--project-id your_project \
--bq-dataset-name medium \
--num-cols 8 \
--num-rows 10000

We are going to run it 4 times, creating 4 BigQuery tables in the same Dataset. If the bq-table-name arg is not provided, a randomly generated one is used.

Script execution log

You are also able to run the bq-fake-pii-table-creator script passing the obfuscate-col-names arg, by leveraging this flag, we will see later how DLP is important.

Created BigQuery Tables

We can see that we have 4 new Tables inside the medium Dataset.

Executing DLP

By going to the DLP UI, we are going to create jobs to inspect those Tables.

DLP UI — Input data panel

On the first panel, we specify the source for the inspection, which will be the newly created Tables.

DLP UI — Add actions panel

Then we specify one action, to save to BigQuery, so we can easily see the results on DataStudio later on.

The other steps were suppressed because the default values were kept.

In the end we have 4 jobs, one for each Table, and we can follow the results on the Jobs page. We will wait for all jobs to go to the done state.

DLP UI — Jobs page

Findings

Once all jobs are complete, we can take a look at our DataStudio dashboard, and visualize the findings:

DataStudio DLP Findings Dashboard

If you want to clone this DLP dashboard, I used the template from this blog post, it’s as simple as cloning and pointing to your BigQuery Dataset/Table. You can believe me when I say it literally takes 30 seconds to set it up! =)

Do you remember the obfuscate-col-names flag?

BigQuery Table, Schema and Preview tabs with fake data

That flag creates non-human readable column names, making it really hard to inspect that data, and DLP automatically found EMAIL_ADDRESS and CREDIT_CARD_NUMBER types on col_1 and col_3, also you can see that not all values are actually Credit Card Numbers, can you imagine doing it manually?

On large organizations, specially when having data dictionaries, it’s not hard to face scenarios with non-human readable column names.

DLP findings for org_incentivize_granular_web_readiness_a0cb Table

Keep in mind that when using DLP, you pay for the bytes inspected, so adopt strategies of sampling small amounts of data at first, when using it on large volumes.

Closing thoughts

In this article, we have covered how to generate random PII in BigQuery and integrate it with DLP, enabling users to quickly analyze their data with Data Studio. Stay tuned for new posts showing how to integrate other systems with DLP! Thanks for reading. Cheers!

References

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai