End-to-End Crypto Shredding (Part II): Data Deletion/Retention with Crypto Shredding

Parviz Deyhim
Sep 9 · 7 min read

In my previous blog, I demonstrated how to leverage BigQuery’s AEAD encryption functions to achieve data deletion, also referred to as crypto-deletion. However, I limited this demonstration to data in BigQuery. But data rarely exists only in one system. What if we have to delete data from all of the existing systems in our pipeline? How can we apply the same data crypto-deletion strategy beyond just BigQuery?

To set the stage, let’s make some assumptions about the use-case, requirements, and outcomes. Let’s assume that we’re ingesting data from external data sources and eventually storing that data in multiple places for consumption, such as:

- BigQuery for analytics purposes

- A fast lookup database, such as Cloud Datastore, for consumption by applications.

In terms of business requirements, let’s assume that we have been asked to provide the means and guarantees such that, for a given record (user-id or any unique id), our pipeline ensures data deletion from anywhere that the record is stored.

To make this more concrete, let’s visualize our initial pipeline:

In the pipeline above, Pub/Sub ingests data in the form of some ID and PAYLOAD. For simplicity, the payload is a string, but in reality, the payload can be multiple fields, i.e. location, browser type, and etc. In addition, Dataflow acts as a simple router to store data in Cloud Datastore and BigQuery. I realize that the real world architecture and data model might be more complex than the example but kept it simple to convey the logic

Given the architecture above, one way to satisfy the requirement of removing data from all systems is to simply leverage an automated workflow that deletes data from each storage individually. While this approach works, it faces a few challenges:

  • It’s common for data to be stored and replicated to a handful of systems. Data deletion from multiple heterogeneous systems can be complicated to implement (multiple logics, code base, etc).
  • Deletion of data can be inconsistent in the face of errors caused by systems that we do not control. In other words, a system can fail to delete our record OR worse it can indicate success asynchronously while failing to delete in the background.
  • It can be inefficient in terms of performance and cost, depending on the data storage model. For example, one may have to scan the entire dataset to find the records that need to be deleted. This is inefficient and, at times, costly.

An ideal deletion process would be one that we fully control, and is simple to implement, efficient (cost and performance), and consistent. The rest of this post demonstrates how to achieve a process that meets those requirements.

In my previous blog, in order to delete data, we leveraged encryption functions with the following pseudo logic:

  1. Encrypt a given record with a unique encryption key.
  2. Store the mapping of the record and its encryption key.
  3. To delete your record, simply delete the encryption key from the mapping table. This makes decryption impossible and the record useless (deleted).

I demonstrated the steps above in BigQuery using the AEAD functions. The great thing about BigQuery’s AEAD encryption function is that the implementation is based on Google’s open-source library called Tink. What that inherently gives us is the ability to encrypt/decrypt data both inside and/or outside of BigQuery, crucial to achieve end-to-end crypto-shredding across heterogeneous systems. Given that, here’s the revised pseudo logic for an end-to-end logic:

  1. At the data ingest point, encrypt a given record with a unique encryption key.
  2. Store the mapping of the record and its encryption key in an external encryption-key mapping storage accessible by all consumers of the data.
  3. To consume a given record, the consumer reads the decryption key from the external encryption-key mapping storage.
  4. To delete a given record, the encryption-key gets deleted from the external encryption-key mapping storage. By deleting the encryption key from the shared storage, we’re guaranteed that no consumer can decrypt the record.

Let’s apply our pseudo logic above to our initial architecture:

The new architecture has a number of new components and integration points

Tink Encryptor is a Cloud Dataflow logic that reads data from Pub/Sub, then generates a new key for any new record that it has not seen before OR retrieves an existing encryption key if this is a repeated record. To store new keys or retrieve existing keys, Cloud Bigtable acts as the shared storage. The key generation, encryption, and decryption processes are supported by Tink library embedded in the Cloud Dataflow logic. Here’s a simple example:

private static byte[] encrypt(String payload, String key, String password) throws GeneralSecurityException,IOException { KeysetHandle keysetHandle = CleartextKeysetHandle.read(JsonKeysetReader.withString(key)); Aead aead = AeadFactory.getPrimitive(keysetHandle); return aead.encrypt(payload.getBytes(), password.getBytes());}private String generateKey() throws GeneralSecurityException, IOException { ByteArrayOutputStream stream = new ByteArrayOutputStream(); KeysetHandle keysetHandle = KeysetHandle.generateNew(AeadKeyTemplates.AES128_GCM); CleartextKeysetHandle.write(keysetHandle, JsonKeysetWriter.withOutputStream(stream)); return new String(stream.toByteArray());}

Dataflow router, from our initial diagram, stays the same, but now routes the encrypted data instead of the original record.

To decrypt the record for processing, the consumers of the records, namely our application and BigQuery, must retrieve the encryption key from Bigtable. Calling Bigtable from our application is simply achieved by calling Bigtable APIs.

For BigQuery to retrieve the decryption key, we’re leveraging BigQuery’s support for federated access to Bigtable. The decryption process in BigQuery involves joining the encrypted records with the mapping table in Bigtable. We’ll look at an example shortly.

Finally, in order to delete a given record(s) from all of our systems (BigQuery and Cloud Datastore in this example), we simply delete the mapping key(s) from Bigtable, making the decryption of the record(s) impossible.

Decryption in BigQuery

Let’s walk through an example to make the BigQuery decryption process more concrete.

In BigQuery, we’ll have two tables:

One has encrypted data originally routed to BigQuery via our Dataflow router:

Another external table points to our key-mapping table in Bigtable:

In order to decrypt data in BigQuery, we’ll do a join between these two tables. The join query allows us to provide both the decryption key and encrypted_payload to BigQuery’s AEAD.DECRYPT_STRING function:

SELECTdata.id AS ID,SAFE.AEAD.DECRYPT_STRING( KEYS.KEYSET_FROM_JSON( mapping.cf1.key.cell.value), FROM_BASE64(data.encrypted_payload), "some_password") AS payloadFROM    dataset.encrypted_table dataJOIN    dataset.bigtable_table mappingON    mapping.rowKey = data.id

Important Considerations

Having demonstrated, at a high-level, how to achieve end-to-end crypto shredding capability, I’d like to point out a few important considerations:

Use this pattern for crypto-shredding only: While the implemented logic leverages encryption functions to achieve crypto-shredding, due to some of the limitations that I’ll discuss further, it will NOT satisfy use-cases that require encryption for privacy and security. To make this more concrete, BigQuery’s AEAD functions can only accept encryption/decryption keys in the form of String or Binary. That poses a challenge where the key stored in the external mapping storage must eventually be transformed to String or Binary. This means that we CANNOT encrypt our encryption keys in Bigtable. In other words, in our example, Bigtable is storing the AEAD encryption/decryption keys in a plain-text JSON format. And while one can protect Bigtable access using GCP IAM policy, to satisfy encryption and privacy use-cases, the key should be stored encrypted. An ideal solution that would satisfy both crypto-shredding and encryption for privacy/security would involve storing the encryption key in a KMS system such as Cloud KMS. In addition to that, BigQuery should allow retrieval of the key from Cloud KMS instead of Binary or String format. I would like to see something similar to this implemented in BigQuery:

SELECTdata.id AS ID,SAFE.AEAD.DECRYPT_STRING( KEYS.KEYSET_FROM_KMS(mapping.kms_key_id, kms_instance_name), FROM_BASE64(data.encrypted_payload), “some_password”) AS payloadFROM    dataset.encrypted_table dataJOIN    dataset.bigtable_table mappingON    mapping.rowKey = data.id

Note: this is just an example and not an actual BigQuery function

Fine-tune your architecture: The architecture and example logics provided in this blog are for demonstration purposes. One should think very carefully about how to architect, given their own technical needs and requirements. An example of this is the data model and choice of the external storage for encryption-to-key mapping. Or, how to implement the Tink encrypt/decrypt/key-generation logic both in the Dataflow or consuming applications.

Using Crypto-shredding as a way to delete data permanently and consistently without sacrificing performance is a very efficient pattern, without the need for more complicated approaches such as DML statements in BigQuery or high-frequency delete API calls to other parts of our data pipeline. Simply be deleting the encryption key, all records belonging to that key are deemed impossible to decrypt anywhere in our pipeline. In addition to that, by using Tink library, one can apply a similar logic not only to GCP native tools (BigQuery, Datastore, etc), but also to many other tools and frameworks such as Apache Kafka, Apache Spark, and others.

Would love to hear from you if after reading this article you have comments in terms of applicability of this pattern. Does this pattern make sense? Please feel free to contact me at: @pdeyhim

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Parviz Deyhim

Written by

Data lover and cloud architect. ex-aws, ex-databricks, and now a Googler

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade