Stop Worrying About BigQuery PII: How to Automate Data Governance at Scale.

PII Discovery, Classification and Access Control for BigQuery

Karim Wadie
Google Cloud - Community
12 min readMay 9, 2022

--

TL:DR There is no doubt that managing and securing PII data is a top priority for many organizations, especially that failure to comply with PII mandates may result in heavy material and nonmaterial penalties on the business. Without an automated framework for governing PII, this sensitive process might be manual, non scalable or introduces more risk of failures (e.g. creating PII and non-PII copies of a dataset)

To address that, we introduce an open-source solution to automatically discover, classify and restrict access to PII data fields in BigQuery according to your organization’s sensitivity levels hierarchy.

Many organizations are migrating their data to BigQuery (or already done that) and tackling the topic of how to “handle” their Personally Identifiable Information (PII). The term “handle” might refer to different topics depending on the organization business needs, compliance requirements or level of maturity, for example common tasks could be:

  • Discovering which columns and tables contain PII.
  • Classifying PII types based on a hierarchy (e.g. High Sensitivity, Low Sensitivity, etc)
  • Adding metadata tags to columns based on their PII type/class as a way of cataloging the data
  • Managing who can read which PII type/class across your projects (for both users and service accounts)

While there are more layers and directions of handling PII data (e.g. Encryption, Anonymization, De-identification, etc), we will focus in this article on the above listed steps as minimum requirements to govern and secure PII data in-place without creating data copies. For that, we assume that you already have data loaded into BigQuery, in raw format (no masking) and with potentially exposed PII (e.g. a landing zone). Sounds familiar? Keep reading!

In the next sections we are going to address each of those steps and the GCP products and features that enable them. Finally, we’ll talk about the BQ PII Classifier open-source solution to automate the entire process.

Step 1: Discovering PII

The short answer to “How to discover PII data on GCP?” is usually the Data Loss Prevention API (DLP). That’s the native GCP offering that you can simply use from the cloud console (UI) to create an “Inspection Job” that scans one data resource (e.g. BigQuery table) based on an “Inspection Template” (that lists certain out-of-the-box or custom PII InfoTypes) and reports the “findings” to a variety of destinations on different levels of granularity (called Actions).

Sample Inspection Template
Sample Inspection Job for a BigQuery table

The most relevant “Action” for us would be the Save findings to BigQuery to export the most granular job findings. A “finding” details one PII detection in a certain row and column along with the likelihood of that matching. Note that we might detect more than one PII type in the same column with different matching likelihood, for example a data row with a “215 Nelson Mandela Street” as a customer_address column might generate two findings; {InfoType: STREET_ADDRESS, Likelihood: VERY_LIKELY} and {InfoType: PERSON_NAME, Likelihood: LIKELY}. Nevertheless, we can control such behavior, if required, via “RuleSets

How could we use this feature to reach our goal of automatically detecting PII across our BigQuery tables?

  • Define and customize what PII InfoTypes we are searching for and list them in an Inspection Template
  • Have an application layer that programmatically creates an Inspection Job per table via one of the DLP client libraries and API (we’ll talk about a native feature, Auto DLP, later in this post and how to use it to achieve our goal)
  • Configure the Inspection Job to store findings in a BigQuery table for further processing in next steps

Step 2: Defining a PII Classification Hierarchy

A simple and common example of a PII Classification Hierarchy could be around sensitivity levels (e.g. High, Medium, Low) where each PII type (i.e. DLP InfoType) belongs under one classification. Reasons to classify PII types could vary but quite often it’s for data governance to grant a certain level of security clearance to users/groups.

GCP covers exactly this use case via Data Catalog and Policy Tags Taxonomies. A “Policy Tag Taxonomy” is a resource that one defines in Data Catalog as an arbitrary user-defined hierarchy of nodes that is meaningful to the business. Each parent or leaf node in this hierarchy could be used as a “Policy Tag” that could be linked to BigQuery columns (i.e. tagging columns with policy tags). Policy tags have two main uses:

  • 1) Acting as a metadata field that is discoverable within Data Catalog (e.g. search for all tables that have columns tagged with “high-sensitivity” policy tag)
  • 2) Enabling column-level access control in BigQuery. In this way we define access control once (on the Policy Tag) and apply it to multiple columns in BigQuery via tagging.

To conclude, in order for us to define a PII Classification Hierarchy we simply need to create a Policy Tag Taxonomy in Data Catalog that includes nodes for all the PII InfoTypes we already defined in the DLP Inspection Template

Sample Policy Tags Taxonomy for PII

Step 3: Tagging BigQuery PII Columns

Manual column tagging is straight forward from the cloud console (UI). One needs to edit the schema of a table, pick the column and select the policy tag out of a taxonomy.

BigQuery Column Tagging (Console)

However, we want to repeat this process many times to cover all PII columns in all our tables. Not to mention, the human who is doing this tagging might not know if a column contains a PII or not to begin with.

Remember DLP Inspection Jobs? and that we configured it to save detailed findings to BigQuery? Now we need to act on these findings.

For each table we want to tag its columns we could do the following:

  • Read DLP findings from the detailed results BigQuery table
  • For each table column; find the InfoTypes that DLP detected. If DLP finds more than one InfoType for that column we promote exactly one (taking into account the number of findings and likelihood or by reporting a special placeholder “MIXED”)
  • Lookup that InfoType in the Data Catalog Taxonomy and return the Policy Tag ID that corresponds to this InfoType
  • Apply the Policy Tag ID to the BigQuery column

Step 4: Managing Access Control to PII

Now we end up with a setup where certain columns in BigQuery are tagged by Policy Tags defined in a Hierarchy/Taxonomy in Data Catalog. All we need to do is to go to the Policy Tags Taxonomy and assign the “Fine-Grained Reader Role to users and groups. This assignment could be on any level of the hierarchy. For example, assigning user1@ the Fine-Grained Reader role on the “Email” tag (leaf node) means that user1 will have access to columns tagged with this exact policy tag. Alternatively, assigning them the same role on the “Medium Sensitivity” tag (parent node) means that they will have access to columns tagged with any leaf node under that parent (i.e. Email and Phone Number).

To follow IAM best practices, we recommend doing this step using Groups instead of Users. For example, one would have 3 groups to govern access to PII [pii-high-sensitivity-readers@, pii-medium-sensitivity-readers@, pii-low-sensitivity-readers@] each of which is granted the Fine-Grained Reader Role on the respective level of the taxonomy. In that case, when we want to onboard a new user or service account to any of these levels it’s just a matter of adding them to the desired group.

Granting Access to Tagged PII Fields

Let’s take a hypothetical table “orders” as an example. This table contains PII columns [Customer Email, Customer Address] that we apply policy tags to and other non PII columns [Item, Order Date, Price, etc]. There are two layers of IAM required here for a user to access this table:

  • Standard BigQuery Data Viewer role on the table, dataset or project level. This will grant access to the table in general to read all columns that are not restricted by policy tags.
  • Additional Data Catalog Fine-Grain Reader role to grant access to the tagged PII columns.

It’s worth mentioning a handy feature here which is a toggle to “Enforce Access Control for a given taxonomy. When this toggle is switched off we can experiment with column tagging without affecting user access to the tagged column.

It goes without saying that IAM bindings explained in this section could also be done programmatically in our attempt to automate all these steps together as we will see in the next section

Automating it All Together

Now that we understand the GCP concepts behind each step towards our goal of governing PII on BigQuery (i.e. PII Discovery, Classification, Tagging and Access Control), it’s time to “glue” everything together to apply this to N tables at scale. For that we’ve developed the BQ PII Classifier open-source solution.

BQ PII Classifier Solution Steps

Architecture Overview

BQ PII Classifier Solution Architecture (Standard Mode)

Inspection Dispatcher Service
A Cloud Run service that acts as an entry point for the solution. It expects a BigQuery scanning scope expressed as inclusion and exclusion lists of projects, datasets and tables. This scope could be one or more tables.

The Inspection Dispatcher will call the BigQuery API to list all tables included in the scan scope and submit a DLP Inspection request for that table in the Inspector Tasks topic.

Inspector Tasks Topic
This PubSub topic decouples the Inspection Dispatcher from the Inspector in order to handle the rate limits of the DLP API and to apply auto-retries with backoffs.

Inspector Service
A Cloud Run service that expects a request to scan one table. It will submit an Inspection job to DLP for that table according to the configured Inspection template and other parameters such as scan limits, results table, notifications topic, etc.

For cost control, the service will limit the number of rows to be scanned based on the table size and a user defined configuration that determines limit intervals (e.g. 0–1000 rows → sample 100, 1001–10000 → sample 500, etc). This sample will be taken randomly from the table.

Tagger Tasks Topic
This PubSub topic decouples the DLP API notifications from the Tagger service in order to handle the rate limits of BigQuery column tagging operations and to apply auto-retries with backoffs.

Tagger Service A Cloud Run service that expects the information of one BigQuery table. It will determine the InfoType of each column based on the latest DLP findings and apply the appropriate policy tag.

Tagging Dispatcher Service
A Cloud Run service that acts as an entry point for the solution. It expects a BigQuery scope expressed as inclusion and exclusion lists of projects, datasets and tables. This scope could be one or more tables.

The Tagging Dispatcher will list all tables included in the scan scope that have existing DLP scan results and submit a table tagging request for that table in the Tagger Tasks topic.

This service is used to trigger a re-tagging run without re-inspecting the table. This is helpful in cases where the classification taxonomy has changed, but the underlying data is the same.

BigQuery Config Views
Terraform generates config views based on the defined mappings and configurations. These config views are mainly used by the Tagger service to determine the policy tagging logic.

Logs
All Cloud Run services are writing structured logs that are exported by a log sink to BigQuery. On top of that there are a number of BigQuery views that help in monitoring and debugging call chains and tagging actions on columns.

Deployment

To deploy the solution you will have to configure the following main sections (via Terraform) that maps to the GCP concepts we discussed earlier:

  • BigQuery Scope: The list of projects, datasets or tables to include or exclude from your runs (i.e. to scan and tag). This also comes with a CRON schedule to determine the frequency of such runs.
BigQuery Scan Scope Config
  • Data Classification Taxonomy: A JSON list of standard and custom InfoTypesdefined as a hierarchy (e.g. High, Medium, Low). This list will be converted into two GCP resources; 1) Data Catalog Policy Tag Taxonomy 2) A DLP Inspection Template
Classification Taxonomy Config
  • Domain Mapping: If you are required to differentiate between PII data access across business units (e.g. Marketing PII readers can only access Marketing PII but not Finance PII) then you will need to tell the solution which data belongs to which “domain” (i.e. business unit) on either project or dataset level. This will result in creating N Policy Tags Taxonomies in Data Catalog, one for each domain (all replicas) where each one is used to tag tables that belong to this certain domain.
Domain Mapping Config
  • IAM Mapping: The list of IAM users and groups who should have access to PII data for a certain domain plus confidentiality level combination (defined in the classification taxonomy earlier)
IAM Mapping Config
  • DLP Scan Limits: In order to keep cost under control, the solution provides a configuration that determines the number (or percentage) of rows to be randomly selected by DLP Inspection Jobs according to the table size. (PS: this config is only applicable in the standard-mode deployment that we will take about in the deployment section)
DLP Scan Limit Config

While these are the main highlights of the configurations, one will need to follow the solution README for the full configuration and deployment procedures.

Running the Solution

Once the solution is deployed to GCP, one could easily trigger it via the entry point inspection-scheduler on Cloud Scheduler:

Triggering an Inspection Run via Cloud Scheduler

This will trigger the Dispatcher service that lists down all tables in the provided BigQuery scan scope (i.e. inclusion and exclusion lists of projects, datasets and tables) and submit one request per table to the Inspector service

Dispatcher Service Logs showing dispatched messages to Inspector Service

Once the Inspector service receives an inspection request for a given table it will create and submit a DLP inspection job for that table while setting all the required parameters (e.g. inspection template, results table, etc)

Inspector Service Logs showing processing steps for one table inspection request
DLP Inspection Jobs Created by the Solution per Table

And finally, once a DLP job has completed it will send a PubSub notification message that will trigger the Tagger Service to interpret the DLP findings and assign the corresponding Policy Tags to columns.

Tagger Service Logs Showing Tagging Actions for Table Fields
BigQuery View Showing Tagging Actions for a Given Run/Table
BigQuery Table with Policy Tags Assigned to it for Access Control

Standard vs Auto DLP

With Auto DLP going GA in April 2022 the BQ PII Classifier extended its functionality to build on top of Auto DLP. For that it comes in two deployment flavors, “Standard DLP Mode” and “Auto DLP Mode” (Standard DLP being the one we’ve been talking about so far).

In “Auto DLP Mode” the solution will skip the Inspector Service (i.e. generating DLP inspection jobs per table) and instead rely on the data profiles that are generated externally by Auto DLP; everything else remains the same.

BQ PII Classifier Solution Architecture (Auto DLP Mode)

It’s recommended to use the “Auto DLP Mode” where possible to utilize the reliability and scalability of the native product feature. The only reason you might lean towards the “Standard DLP Mode” is if you are looking for one or more of the following:

  • Tight DLP cost guarantees: The “Standard DLP Mode” provides a function to define the sampling size of BigQuery tables that enables you to estimate the DLP cost generated by the solution. At the time of writing (April 2022) Auto DLP doesn’t provide cost estimation, nevertheless, it’s a feature in progress.
  • Granular BigQuery Scope: “Standard DLP Mode” let you configure the scanning and tagging scope on Project, Dataset and Table levels via inclusion and exclusion lists while Auto DLP let you configure the scan scope on Organization, Folder or Project levels (no exclusion lists).
  • Granular Control on Run Frequency: With Standard DLP Mode” you can configure N Cloud Schedulers with different BigQuery scopes and CRON run frequencies. With Auto DLP you can choose to re-scan a table daily or monthly on certain triggers (e.g. schema changes and adding rows) but you can’t explicitly trigger a run.

Conclusion

In this article we explained a number of steps to govern PII data in BigQuery at scale. We demonstrated the major GCP capability behind each step and show cased an open-source solution that you can deploy to automate the entire process end-to-end.

--

--