Ensuring GDPR Compliance on GCP BigQuery: Efficiently Managing the Right to Be Forgotten

Florian Trehaut
13 min readMay 10, 2023

--

Are you currently facing challenges implementing GDPR on your GCP or AWS data platform? I specialize in helping businesses achieve compliance, so feel free to connect with me on LinkedIn for a free consultation.

Imagine a world, where to definitely remove all personal data of a customer across all your GCP platform, you would only have to remove ONE row in ONE table, and the anynomization is propagated, free of charge, in all other tables. It would great, wouldn’t it ? Well, it’s actually easier than you think. Let’s see that together.

Illustration image. DALL-E generated

Introduction : What are we talking about ?

In today’s data-driven world, GDPR compliance is crucial for businesses handling personal data. In this article, we’ll explore an efficient solution for implementing the Right to Be Forgotten (RTBF) on Google Cloud Platform’s BigQuery, an analytics database.

If you read this article, it’s likely that you already have an anonymisation process to comply, and it’s also likely that it consist of going through each tables and each columns containing Personal Identifiable Informations (PIIs), then delete them / hash them when needed. That definitely does the job, but it has some flaws, and there is way more efficient methods.

But wait, first, WTF is RTBF ?

The General Data Protection Regulation (GDPR) is a comprehensive data privacy regulation implemented by the European Union (EU) in 2018. It aims to protect the personal data of EU citizens and provide them with more control over their information. Companies that handle the personal data of EU residents, regardless of their location, must comply with GDPR or face significant penalties.

One of the key principles of GDPR is the Right to Be Forgotten (RTBF), also known as the right to erasure. This principle allows individuals to request the deletion of their personal data from a company’s records under certain conditions, such as when the data is no longer necessary for the original purpose, or when the individual withdraws their consent. The RTBF is important because it empowers individuals to maintain their privacy and control over their data in the digital age.

Fulfilling the Right to Be Forgotten can be a complex process for organizations, especially those dealing with vast amounts of data stored in various formats and locations. This challenge is particularly relevant for companies that use analytics databases, such as Google Cloud Platform’s (GCP) BigQuery, where personal data may be scattered across multiple tables and datasets. Efficiently managing the RTBF in such environments is crucial not only for ensuring GDPR compliance but also for maintaining the trust and confidence of customers and stakeholders.

Big picture : Implementing RTBF on Bigquery

We will focus on implementing the Right to Be Forgotten (RTBF) in the context of Google Cloud Platform’s (GCP) BigQuery, a powerful and scalable analytics database. As businesses increasingly rely on BigQuery to process and analyze large volumes of data, ensuring GDPR compliance becomes an essential task. Other parts of GDPR won’t be covered here yet.

Our primary goal is to present an efficient solution for managing the RTBF on GCP BigQuery by utilizing encryption keys for each individual’s data. We will explore the architecture and rationale behind this approach, as well as the processes involved in encrypting personal data and handling RTBF requests. Don’t worry, it’s not that cumbersome, ans mostly, everything works natively on Bigquery so you will never have to code in other language than SQL !

By the end of this article, you will gain valuable insights into an effective method for complying with GDPR’s RTBF requirements while using GCP BigQuery for your data-driven projects. This knowledge will help your organization not only maintain compliance with privacy regulations but also enhance the trust and confidence of your customers and stakeholders.

But why don’t you just hash / delete PIIs in each tables when requested ?

I don’t recommend going through each table and column containing personally identifiable information (PII) to delete or hash the data when there is a better alternative, for several reasons:

  1. Time-consuming and complex process: Navigating through all tables and columns containing PII in a large-scale analytics database like BigQuery can be time-consuming and complex. The process often involves identifying the relevant tables, locating the specific columns, and carefully handling the data, which can be prone to human error and oversight.
  2. Impact on data integrity: Deleting or hashing PII data directly in the database can lead to data integrity issues. For example, if a user is associated with various actions or events in the database, removing or altering their PII might disrupt the relationships between data points, resulting in inaccurate analyses or reports.
  3. Difficulty in maintaining data consistency: If data is spread across multiple tables and columns, deleting or hashing PII can be challenging, as you need to ensure consistency across all instances of the PII in the database. This process can become particularly complicated when dealing with frequently updated or large datasets.
  4. Doesn’t implement access restriction: Ok this one is not really a downside of this method, but it has to be noted that if you want to implement access restriction on PIIs, you will have a develop a complete different process. If you had not, in case of a data breach, customer data are exposed and can be easily extracted.

Given these challenges, seeking a more efficient and reliable alternative for managing the RTBF on BigQuery is advisable. Implementing a solution that utilizes encryption keys for each individual’s data can help address these concerns and ensure GDPR compliance without compromising data integrity, scalability, or performance.

Overview of the proposed solution

In this section, we will describe the architecture of my proposed solution for implementing RTBF on BigQuery. This approach involves the use of separate GCP projects and BigQuery tables, which enhances both security and manageability.

Note that for this example, my platform is structured in a datamesh way, which imply to use separates projects for datalakes (Raw data extracted from sources), datawarehouse (Processed and deduplicated data, with use-case agnostic rules) and datamarts (Specific business KPIs. Also an exposition layer), but you can adapt it to your platform. I strongly advice to isolate the GDPR project.

Also note that for this example, all process will be done in SQL Bigquery requests, using the DBT framework, but it would be the same process on any SQL framework, and of course, it’s also doable in programming language, like Python, Java, Go…

Oh and let’s make this fun, let’s say that your way of identifying a customer isa PIIs that need to be also anonymized.

But first, what does it look like ?

Technical architecture

Looks scary right ? Don’t worry, we will go step by step

The architecture can be broken down into the following components:

  1. Data GCP DLK Project (In blue): It’s in this project that the data is first collected, then stored, without encryption, in a first layer of partitionned tables with data expiration (So we don’t have to deal with data anonymization there, expiration must be less than one month). If the access restriction is critical for you business, there is a way to never store unencrypted data in the first place, by using a statefull Dataflow job, but it won’t be covered in this article (Maybe in a next one ?). It’s also in this project that the encryption will take place in a second layer of tables, where the data will be stored encrypted. Every PIIs or part of PIIs is encrypted, like the firstname, lastname, birthdate, address… I strongly advice to encrypt most of them, even if you think that a column is not “really” a PIIs as in itself it does not identify a customer, but combined, it is. What you will not encrypt must be communicated and validated with your DPO for business purposes (Example : birthdate, country, city…)
  2. GDPR GCP Project (In grey): It’s in this project that encryption keys will be generated and stored. We have one encryption key for each individual. Because we assume that to identify an individual, we only have a PII, then we will also generate an UUID for each individual, which will be used from now on as the only way to identify them. This article will not cover how to restrict access for unauthorized users as we focus on RTBF part, but it will be covered in a future article, stay tuned ! It’s also in there that we will store anonymization requests (I assume that you already receives those, including expiration of retention delay)
  3. Data GCP DWH Project (In green): The easier part, you just process your data as usual (You just add a decryption part) and you ensure to never store the decryption result
  4. Data GCP UC Project (In red): Your exposition layer (if you have one). Here you will allow your authorized user / dashboard to transparently read PPIs data, just by doing a normal SELECT, and let the decryption part happen under the hood (Yes I mean it, your user will just have to do a normal SELECT statement !)
  5. Data Example Tables in each step (On the right): On each step you can see a table example of data stored in a BigQuery table. When you see a field in red, it means that a value is stored, but it cannot be decrypted anymore

The benefits of using this architecture include:

  1. Enhanced security through the segregation of encryption keys and personal data, reducing the risk of unauthorized access.
  2. Simplified management of encryption keys in a separate GCP project, allowing for better control and organization.
  3. Streamlined compliance with GDPR’s Right to Be Forgotten by enabling the deletion of an individual’s encryption key, effectively anonymizing their data across your entire platform, as long as you only store encrypted data.

Data flow : What does my data looks like in each step ?

This should be less scary, because the real data process is pretty simple. Let’s dive in again step by step

I. First integration of individual in the platform :

  1. A row arrives in our platform. For example : ftrehaut@pm.me, Tréhaut
  2. Row is stored in clear in a DLK table, with a partition-level expiration (Rows in a partition will be automatically deleted after a defined amount of time, less than one month for RTBF compliance)
  3. Daily, the row is processed, we first look in the keyset table if the mail ftrehaut@pm.me exists. In our case it doesn’t, so we generate two things and store them in the keyvault table :
    a. An UUID (Identify the individual)
    b. A keyset
  4. With the new generated keyset, we encrypt all PIIs (including email), and we store the line in a DLK table with UUIDs. From now on, individuals are identified with their UUID. Keys will also be found with UUID.

II. Second integration of individual in platform :

  1. A second row arrives : ftrehaut@pm.me, Tréhaut
  2. Row is stored in clear in a DLK table, with a partition-level expiration (Rows in a partition will be automatically deleted after a defined amount of time)
  3. Daily, the row is processed, we first look in the keyset table if the mail ftrehaut@pm.me exists. In our case, it does exists, so we get UUID and keyset
  4. With the retrieved keyset, we encrypt all PIIs (including email), and we store the line in a DLK table with UUIDs. From now on, individuals are identified with their UUID. Keys will also be found with UUID.

III. Anonymization :

  1. A RBTF request is received from ftrehaut@pm.me
  2. Corresponding row in keyvault table is updated :
    a. UUID is kept
    b. email is removed
    c. keyset is removed
  3. After a defined delay, the DLK table with partition expiration will loose the row with unencrypted data
  4. All encrypted PIIs are now unrecoverable. Individual is successfully anonymized in all tables

IV. Individual interact again with the brand :

  1. Because we cannot find their email again in keyset table, back to step 1

So, not that hard, right ? There is also concerns with potential backups, but hey, it’s a “quick”start, I will cover that in a later article.

I talked about encryption a lot, but I did not explained how, which algorithm and why yet. Again, it’s simpler than you think

Encrypt all the things !

First, I promised you three things :

  1. It’s natively handled by Bigquery (It means that you can handle the entire process only in SQL requests if you don’t want to rely on external modules)
  2. It can also be handled in Python, Go, Java…
  3. It’s really simple

Let’s go ! But first, let’s talk about the algorithm. We will stay simple, but you can skip this part.

Algorithm chosen : Strong, deterministic

DETERMINISTIC_AEAD_AES_SIV_CMAC_256 is a deterministic authenticated encryption with additional data (AEAD) algorithm that provides strong encryption and authentication. It combines the Advanced Encryption Standard (AES) with the Synthetic Initialization Vector (SIV) mode and uses a Cipher-based Message Authentication Code (CMAC) with a 256-bit key. Let’s break down the components and benefits of this encryption method:

  1. Deterministic AEAD: Deterministic AEAD encryption algorithms produce the same ciphertext for the same input plaintext and associated data, as long as the same key is used. This property is useful for scenarios where duplicate data detection is essential, such as database indexing or deduplication, without revealing the actual content of the data.
  2. AES: The Advanced Encryption Standard is a widely-used symmetric encryption algorithm that has been extensively analyzed and is considered highly secure. It supports key lengths of 128, 192, or 256 bits, with the 256-bit version offering the highest level of security.
  3. SIV Mode: The Synthetic Initialization Vector mode is a nonce-misuse resistant mode of operation that mitigates the risks associated with using a non-unique nonce (number used once). SIV mode generates a deterministic initialization vector based on the input data, eliminating the need for a unique nonce and reducing the risk of nonce reuse.
  4. CMAC: Cipher-based Message Authentication Code is a cryptographic primitive used for message authentication. It ensures data integrity and verifies the authenticity of the encrypted data by generating a unique tag based on the input data and the encryption key.

We will not take full advantage of this method because it would be overkill. You do not need to retain what I written above (I probably won’t either), but you need to retain what follow : Why this one ?

  • Strong security: The use of AES with a 256-bit key provides a high level of security, making it suitable for protecting sensitive personal data.
  • Duplicate data detection: The deterministic nature of the algorithm allows for database indexing and deduplication while maintaining data privacy.
  • Nonce-misuse resistance: The SIV mode mitigates the risks associated with non-unique nonces, reducing the chances of encryption vulnerabilities.
  • Data integrity and authenticity: The CMAC component ensures the encrypted data remains intact and can be verified as authentic.

But to be fair, the first two points are the most relevant. I insist on the benefit of using deterministic encryption as it will allow you to efficiently implement incremental SQL requests (Merges) which can help you save a lot of time and money.

How does it works on Bigquery ?

I’m happy to tell you that the complicated and boring stuffs are over. The implementation is way simpler than you think. We need 3 steps :

  1. Generate a keyset for each individual
  2. Encrypt personal data with the generated keyset
  3. Decrypt personal data with the generated keyset

We will use native AEAD functions on Bigquery, documentation is found here : https://cloud.google.com/bigquery/docs/reference/standard-sql/aead_encryption_functions

Ok let’s generate our keyset, stored in our GDPR project

WITH contacts AS (
SELECT
"ftrehaut@pm.me" as email,
"trehaut" as lastname
)

SELECT -- Let's name this table VAULT
GENERATE_UUID() as uuid,
contacts.email,
KEYS.NEW_KEYSET('DETERMINISTIC_AEAD_AES_SIV_CMAC_256') as aead_key,
false as is_anonymized,
CHR(NULL) AS anonymization_reason,
CAST(NULL AS TIMESTAMP) AS anonymisation_timestamp,
CURRENT_TIMESTAMP() AS ingestion_timestamp,
FROM contacts

Yep, that’s that easy. Let’s keep going by encrypting our PIIs. Results will be stored in a Data project

WITH contacts AS (
SELECT
"ftrehaut@pm.me" as email,
"trehaut" as lastname
)

SELECT -- Let's name this table ENCRYPTED_DATAS
VAULT.uuid,
SAFE.DETERMINISTIC_ENCRYPT(VAULT.aead_key, email, VAULT.uuid) AS encrypted_email,
SAFE.DETERMINISTIC_ENCRYPT(VAULT.aead_key, lastname, VAULT.uuid) AS encrypted_lastname,,
FROM contacts
LEFT JOIN KEYVAULT USING(email)

And now the decryption part

WITH contacts AS (
SELECT
b"ENCRYPTED_DATA" as encrypted_email, -- Of course your encrypted data won't look like this
b"ENCRYPTED_DATA" as encrypted_lastname
)

SELECT -- Let's name this table ENCRYPTED_DATAS
VAULT.uuid,
SAFE.DETERMINISTIC_DECRYPT_STRING(VAULT.aead_key, email, VAULT.uuid) AS encrypted_email,
SAFE.DETERMINISTIC_DECRYPT_STRING(VAULT.aead_key, lastname, VAULT.uuid) AS encrypted_lastname,
FROM contacts
LEFT JOIN KEYVAULT USING(email)

There is some questions you may ask yourself :

  1. Why uuid as third parameter of your encryption/decryption functions ? : Honestly ? No reason, you had to put a fixed value there and this was goodenough for me
  2. Why do you add “SAFE.” before your functions ? : Because you don’t want you process to fail when you cannot decrypt a data, it WILL happen, because you will erase some decryption keys when requested to.

How does it work on Python ?

This will be the same process. I will assume that you are able to retrieve yourself your existing keyset and generate one only when the individual is new to your database.

We will use the TINK library from Google. Tink is currently production ready for the following language : Java, C++, Obj-C, Go and Python. Let’s focus on Python as this is most likely the language you are currently using. I will use Python 3.9, but Tink is also available for Python 3.7 and 3.8

pip3 install tink
  1. Generate a keyset for each individual
  2. Encrypt personal data with the generated keyset
  3. Decrypt personal data with the generated keyset

The python/go/java version won’t be covered here, but you can find the documentation on this link : https://github.com/google/tink

Anonymization process

Now that we have successfully generated and stored keysets, and encrypted all our data, anonymization is blazingly simple, it only consist of a daily/weekly/monthly update of your keyset table, here’s an example :

UPDATE VAULT
SET
email = NULL,
aead_key = NULL,
is_anonymized = TRUE,
anonymization_reason = "You Reason", -- Is this a request or a retention expiration ?
anonymizartion_timestamp = CURRENT_TIMESTAMP(),
FROM ANONYMIZATION_REQUESTS -- I made up this one, I assume you have a way to get those requests
WHERE VAULT.email = ANONYMIZATION_REQUESTS.email
AND VAULT.ingestion_timestamp <= ANONYMIZATION_REQUESTS.requested_date -- We don't want to erase a customer that came back
AND NOT VAULT.is_anonymized

Conclusion and next steps

That’s it ! You now a a solid foundation to adapt this process to your use case. Of course you need to handle your service accounts accounts rights to allow your automatic processing to happen, but because the architecture between different companies differs a lot, it will not be represented there.

Well we only covered RTBF here, but what about fine grained access restriction ? What about governance ? What about tools that allow you to ensure that you didn’t store unencrypted PIIs data ? What about other GDPR rights ? See this article as a foundation, the first layer of what you need to do ASAP if you still don’t comply with current rules. A complete solution covering all these aspects exists and will be covered in futures articles, sto stay tuned ! In the meantime, do not hesitate to contact me for questions, corrections and of course any helpful criticism. This is my first Medium article (Not my last) and I’m willing to improve !

You can find my LinkedIn here : https://www.linkedin.com/in/florian-trehaut/

--

--