De-identifying BigQuery data using SDP

Published in

Google Cloud - Community

5 min readJul 7, 2024

One of the many Google Cloud services is called Sensitive Data Protection (SDP). The purpose of this service is to find sensitive data (PII) and optionally de-identify it. Our focus here is on the use of the SDP de-identification features to de-identify data that can be found in BigQuery tables. BigQuery natively has column level security. This allows us to de-identify data at query time. We can apply policy tags to columns and then a query performed on those columns results in de-identified data. While this is more than sufficient for many use cases, we want to examine the use of SDP de-identification as an alternative. Some of the potential benefits of this are:

After de-identification, the data can be written to a table such that the only copy is de-identified.
Templates can be defined that describe how we wish to de-identify data.
We can de-identify data contained within text strings.

Let’s look at a high level architecture of our story:

We will work from the outside in.

First, we will write a SQL Function. This function will take as input a set of name/value pairs corresponding to the column names and their values that we wish to de-identify. This will return the same name/value pairs but with the values replaced by their de-identified equivalents.

The SQL Function will be implemented as a “remote” function. This means that when we invoke the SQL Function, it will actually invoke an external service (a microservice) that will perform the work. We choose to host this microservice as a Cloud Run service.

The Cloud Run service takes as input a set of name/value pairs and must return a de-identified set of name/value pairs. The Cloud Run service interface matches that of the SQL Function. In our story, we will implement the Cloud Run service in Java but any of the other supported languages (Python, Node, C# etc) could just as easily have been used. The heart of the Cloud Run service will be the invocation of the SDP provided de-identification service. This de-identification service is exposed through SDP as a REST service but has language specific client bindings to make it easier to consume. The signature of the SDP de-identification service wants:

A Table containing:
Headers — A list of column names contained in the table
Rows — A list of rows where each row is a list of values corresponding to the columns
A de-identification template reference — A reference to a de-identification template keyed by column name. This contains a set of rules that describe how to transform the value of a named column to its corresponding de-identified value.

With this design in mind, let’s see how it might look.

First, an example of the function being used:

WITH DATA AS (
  SELECT
    "Bob Smith" AS name,
    50000 as salary,
    "123-45-6789" AS ssn,
    CAST("2024-07-04" AS DATE) AS purchaseDate,
    "Bob's home phone is 817-567-8900." AS notes UNION ALL
  SELECT "Jane Doe", 70000, "987-65-4321", "2024-07-09", "No Notes"
),
deid AS (
  SELECT remote_functions.deidentify(
    TO_JSON(STRUCT(name, salary, ssn, purchaseDate, notes))
  ) AS deid_json FROM data
)
SELECT STRING(deid_json.name) AS name,
  INT64(deid_json.salary) AS salary,
  STRING(deid_json.ssn) AS ssn,
  CAST(STRING(deid_json.purchaseDate) AS DATE) AS purchaseDate,
  STRING(deid_json.notes) AS notes
FROM deid

In the above example, we are calling a function called deidentify() that is contained in the remote_functions dataset. This function takes a JSON object as input where each field in the JSON object corresponds to a column name/value pair. The return from the function is itself a new JSON object that contains the same name/value pairs with the values replaced with their de-identified values.

Since we are passing in a JSON object, we need to construct that object. The JSON object will have the form:

{
  "columnName": "columnValue",
  ...
}

In BigQuery, we can construct such an object in a few ways:

JSON_OBJECT — Using JSON_OBJECT, we can supply name/value pairs.

For example:

JSON_OBJECT(
  "name", name,
  "salary", salary,
  "ssn", ssn,
  "purchaseDate", purchaseDate,
  "notes", notes
)

TO_JSON — Using TO_JSON we can supply a STRUCT

TO_JSON(STRUCT(name, salary, ssn, purchaseDate, notes))

Next, let us look at the definition of the deidentify function:

CREATE OR REPLACE FUNCTION remote_functions.deidentify(values JSON)
RETURNS JSON
REMOTE WITH CONNECTION us.remote_conn1
OPTIONS (
  endpoint = 'https://us-central1-test1-305123.cloudfunctions.net/relay',
  user_defined_context = 
  [
    ("project", "test1-305123"),
    ("template", "projects/test1-305123/locations/global/deidentifyTemplates/my_deidentify_1")
  ]
)

We declare the function as taking a single JSON object as input and returning a JSON object as output. We specify the BigQuery connection we should use. Next we specify the endpoint that the remote function should invoke when called. Finally, we specify a couple of user_defined_context properties. These are passed to the Cloud Run service with every call. We specify them here. An alternative could have been to pass them with every invocation but that was felt to be repetitive.

Of course, we need the code of the Cloud Run service. This can be found here.

Deploying to Cloud Run

To deploy our Cloud Run code, we can execute the build.sh script. This executes a gcloud run deploy command which submits the source to Cloud Build to construct a container and then registers it with Cloud Run.

#!/bin/bash
gcloud run deploy deidentify \
  --source=. \
  --platform=managed \
  --region=us-central1 \
  --no-allow-unauthenticated \
  --description="Invoke SDP De-identify for BigQuery" \
  --labels=app=deidentify \
  --max-instances=2 \
  --min-instances=0

Don’t forget to authorize the Cloud Run service to be allowed to be invoked by the service account that was created for you when you created your BigQuery connection.

De-identifying a parquet file

One of my original goals was to transform a parquet file found in Google Cloud Storage to a new parquet file to also be created in Google Cloud Storage. Here is a recipe for using BigQuery as the engine to perform the transformation. Its architecture is:

Create an external table declaration pointing to the parquet source data.
Export the external table data to a Google Cloud Storage bucket using the de-identification transform to de-identify the data.

# Create an external table pointing to the parquet data
CREATE OR REPLACE EXTERNAL TABLE `sdp_tests.datax_ext`
  OPTIONS (
    format ="parquet",
    uris = ['gs://kolban-sdp-tests/data1/*']
    );

# Export the table to parquet using the de-identified content
EXPORT DATA OPTIONS(
  uri='gs://kolban-sdp-tests/data1_deid/*',
  format='PARQUET',
  overwrite=true) AS
WITH deid AS (
  SELECT remote_functions.deidentify(
    TO_JSON(STRUCT(name, salary, ssn, purchaseDate, notes))
  ) AS deid_json FROM `test1-305123.sdp_tests.data1_ext`
)
SELECT STRING(deid_json.name) AS name,
  INT64(deid_json.salary) AS salary,
  STRING(deid_json.ssn) AS ssn,
  CAST(STRING(deid_json.purchaseDate) AS DATE) AS purchaseDate,
  STRING(deid_json.notes) AS notes
FROM deid;

# Drop the external table
DROP TABLE `sdp_tests.datax_ext`

References

Github project for the Cloud Run code: https://github.com/kolban-google/bigquery-sdp-deidentify/

De-identifying BigQuery data using SDP

Deploying to Cloud Run

De-identifying a parquet file

References

Written by Neil Kolban