BigQuery Encryption Functions — Part I: Data deletion/retention with Crypto Shredding

Parviz Deyhim
Google Cloud - Community
4 min readMay 23, 2019
Image result for bigquery

There are a number of features that set BigQuery apart from other data warehouses — large-scale streaming ingest, automatic data archival without performance penalties, and integrated machine learning functions are just a few. Most recently, we released BigQuery encryption functions which enable a broad and important set of abilities, including what we’ll discuss today: data deletion and retention using crypto-shredding.

Why might you want to delete data from BigQuery? Many businesses operate within regulations around data retention — for example, GDPR’s “right to be forgotten” clause which stipulates that a user’s data should be deleted when requested. More specifically, what we’d like to accomplish is this scenario:

  1. User data is stored in multiple BigQuery tables
  2. A compliance requirement states that a specific user’s data should be deleted from all tables upon request

So let’s quickly review two common patterns that let you delete data from BigQuery:

DML Delete

Traditionally, this simple and well-known pattern enables you to delete rows of data. The following DELETE query example comes from a very useful blog my colleague wrote on performing large scale mutations in BigQuery:

The idea here is very simple: use a delete DML statement to remove a range of rows that matches a specific condition.

Create or replace a table

This pattern is also very simple: we’re just recreating the table from the result of the same table. In this case, we’re replacing the table by filtering the rows we are required to delete. And since the operation is atomic, we’ll never have a case where the replace operation causes inconsistent results.

For the majority of use cases, the two approaches above should suffice. However, given the initial scenario in which the user’s data spans multiple tables, we need to apply the above delete patterns to multiple tables. This can be inconvenient, error prone, or cause security concerns, particularly if our delete operation fails to delete from some tables but not from all. It’s also potentially less cost-effective if the delete operation is running on a large dataset.

One approach that allows us to delete data from all tables at once is crypto-shredding. This technique helps us delete specific data from all tables via the following steps:

  1. Your encryption and decryption keys are created using BigQuery encryption functions.
  2. Your data is encrypted using that encryption key.
  3. Your decryption key is stored safely and is leveraged for decryption.
  4. Once the data in question is deleted, its decryption key is also deleted.
  5. The following example demonstrates the steps above using BigQuery encryption functions.

First, we create a table that has a unique key per user.

Next, we’ll use the keys created in the previous step to encrypt and store the user’s email address.

The critical step is to ensure that all user data, in whichever table, is encrypted using that user’s individual key. In order to query the encrypted table, we’ll use the AEAD.DECRYPT_STRING() function.

Now if we want to delete a user from our system, we simply delete the encryption key for that specific user.

Let’s summarize what happened here:

Note: Although they’re out of scope for the purposes of this post, you’ll need to address some considerations when applying this pattern: garbage collection of unusable encrypted data after key deletion, key rotation, and the impact to read queries, for example.

Encrypting data in BigQuery allowed us to quickly delete all of a user’s data by deleting the encryption key (since we can never decrypt the user’s data without it). But let’s take this a step further: what if you want to ensure data is deleted not only in BigQuery but anywhere where data is stored in your data processing pipeline?

In the next blog post, we’ll walk you through an example where you can apply crypto-shredding across all of your data processing components by encrypting data outside of BigQuery, and yet be able to decrypt data after it lands in BigQuery. For the time being I would love to hear from you: pdeyhim@ on Twitter. Also check out this clever trick using BigQuery encryption functions from our developer advocate Felipe Hoffa: does my account have access to BigQuery

--

--

Parviz Deyhim
Google Cloud - Community

Data lover and cloud architect @databricks (ex-google, ex-aws)