BigQuery, PII, and DLP: The Perfect Match
How to generate random PII in BigQuery, run DLP on it and quickly see the results on DataStudio
There are so many regulations nowadays — CCPA, GDPR, HIPAA — regarding information protection, and many new ones to come, that it’s hard to be up to date with everything that comes with it. We can’t ignore it anymore, hence finding a way to deal with it is a must.
Google Cloud Data Loss Prevention helps you scratch the surface on what those regulations are about, by automatically discovering sensitive data everywhere.
Data Loss Prevention (DLP) in general is a broad field and has a lot of great features. We won’t be covering the details of DLP here but there is plenty of excellent literature upon it found elsewhere. What we will look at is the Google Cloud DLP
and BigQuery
integration.
Disclaimer: All opinions expressed are my own, and represent no one but myself…
Load PII data into BigQuery
To begin with, we need some Personally Identifiable Information (PII) loaded into BigQuery.
And to help with that, I’ve developed a Python script that relies on the Faker
library, to generate random PII.
The script is licensed under the MIT license, so feel free to get it and run according to the instructions. Contributions are welcome, by the way!
It’s a command-line tool that allow users to choose the number of columns and rows to be created, the BigQuery dataset and table names.
Let’s look at the main code:
- line 8: creates a Pandas DataFrame containing the PII with the provided args;
- line 12: creates a CSV file from the DataFrame representation;
- line 17: uploads the CSV file to a temporary Storage Bucket;
- line 22: creates the BigQuery Dataset if it does not exist;
- line 23: creates a BigQuery load job, to send the contents from the uploaded file to a new Table and waits for it to finish;
- line 24: cleans up the temporary Storage Bucket;
Executing the script
# Command line execution
python main.py \
--project-id your_project \
--bq-dataset-name medium \
--num-cols 8 \
--num-rows 10000
We are going to run it 4 times, creating 4 BigQuery tables in the same Dataset. If the bq-table-name
arg is not provided, a randomly generated one is used.
You are also able to run the
bq-fake-pii-table-creator
script passing theobfuscate-col-names
arg, by leveraging this flag, we will see later how DLP is important.
We can see that we have 4 new Tables inside the medium
Dataset.
Executing DLP
By going to the DLP UI, we are going to create jobs to inspect those Tables.
On the first panel, we specify the source for the inspection, which will be the newly created Tables.
Then we specify one action, to save to BigQuery, so we can easily see the results on DataStudio later on.
The other steps were suppressed because the default values were kept.
In the end we have 4 jobs, one for each Table, and we can follow the results on the Jobs page. We will wait for all jobs to go to the done
state.
Findings
Once all jobs are complete, we can take a look at our DataStudio dashboard, and visualize the findings:
If you want to clone this DLP dashboard, I used the template from this blog post, it’s as simple as cloning and pointing to your BigQuery Dataset/Table. You can believe me when I say it literally takes 30 seconds to set it up! =)
Do you remember the obfuscate-col-names
flag?
That flag creates non-human readable column names, making it really hard to inspect that data, and DLP automatically found EMAIL_ADDRESS
and CREDIT_CARD_NUMBER
types on col_1
and col_3
, also you can see that not all values are actually Credit Card Numbers, can you imagine doing it manually?
On large organizations, specially when having data dictionaries, it’s not hard to face scenarios with non-human readable column names.
Keep in mind that when using DLP, you pay for the bytes inspected, so adopt strategies of sampling small amounts of data at first, when using it on large volumes.
Closing thoughts
In this article, we have covered how to generate random PII in BigQuery and integrate it with DLP, enabling users to quickly analyze their data with Data Studio. Stay tuned for new posts showing how to integrate other systems with DLP! Thanks for reading. Cheers!
References
- BigQuery Fake PII Creator: https://github.com/mesmacosta/bq-fake-pii-table-creator
- DLP official docs: https://cloud.google.com/dlp