The Right To Be Forgotten(GDPR) framework on AWS data lake

Pritam Pan
7 min readSep 19, 2022

--

Background

In 2006, British mathematician Clive Humby said, “Data is the new oil”. Fast-forwarding to the present day, it’s clear that data really is the new oil and we need to handle the data responsibly.

To protect user/customer privacy, in 2016, the European Union adopted the General Data Protection Regulation (GDPR), and many other countries also took a similar stand (CCPA, LGPD, POPI etc.) on user data security and privacy. Since then the safety of user data has never been more important.

Every company now a days is a data-driven company (at least trying to be one), and process huge amount of data to support business decisions. The utmost priority for them should be user data privacy and security.

Introduction

Becoming compliant with all these privacy rules were never easy and had to put a humongous amount of effort as a company. In this blog post, I will focus on only the “Right to be forgotten (RTF) framework” concept for next-generation data lake on the AWS cloud.

For the implementation, I used Apache Spark’s Python API (PySpark) runs on AWS EMR for developing the configuration driven RTF data pipeline and Apache Airflow to schedule the same. Also, Delta Lake was used to ensure ACID transactions on AWS S3. Python was used to glue it all together.

Design & Implementation

The design goal was to build a framework with the following features:

  • Easily extendable to support any type of sync (storage) in future.
  • The framework should be configuration driven so that the onboarding of a new table/ data entity could (e.g. snowflake table, AWS S3 path) be done very easily without touching the code.
  • The execution for any storage system, for any entity, should be controllable, in other words in a given day we should be able to execute the RTF process for the certain entity(s) for one or multiple sync/storage.
  • It should be idempotent and deterministic in nature, meaning that it would produce the same result every time it runs or re-runs.
  • If required, should be able to run the RTF process for all the RTF user IDs.
  • Maintain audit log for future traceability, so that we can easily pinpoint each change in our data lake for any user’s RTF request.
Overview of RTF Process on AWS

As you can see in the above diagram, the framework comprises many components. Here we will focus on the important components as described in a later section.

New RTF Requests

Once an individual user submits an RTF request for the data to be deleted from our system, it is first validated and then approved for further processing.

We used an messaging queue (named MBUS) system to receive new RTF requests to the data platform. One example of such messages is as below -

{
"accountId":"xxxxxxxx-dca1-11xx-b399-xxxxxxxxxxxx",
"erasedAt":"2022-05-26T04:40:42.736Z",
"publishedAt":"2022-05-26T04:40:44.089Z",
"_meta":{<redacted>}
}

Here accountId identifies a user uniquely across all the systems. And we use the same to anonymise any PII (personally identifiable information) data for the user.

Configurations

There are 2 types of configuration used in the whole framework.

  • RTF Metadata
  • Global Configuration

RTF Metadata

This metadata file has the information for all the entities where the user’s PII (personally identifiable information) data is stored and how we can forget them in case of an RTF request. The metadata is written in YAML format, for ease of maintainability. Each table definition is started with:--- !RTFMetadata Which is used as the YAML tag so that we can convert it to a Python object very easily. Below is the dataclass doing all the magic without extra parsing logic.

Python DataClass representing RTF Metadata

There are 11 fields which should be populated, based on the current state of the framework:

  1. entity_name_product: An easy to understand name for the table/ entity, used for audit logging.

2. snowflake_table: The name of the table in snowflake, e.g. PROD.DWH.USER_HISTORY

3. snowflake_select_condition: The snowflake SQL statement to select the count of rows to be updated for the current batch of RTF users in snowflake

4. snowflake_update_condition: The snowflake SQL statement to update the rows for the current batch of RTF users in snowflake

5. s3_location: The location of the data in AWS S3, e.g. $s3_bucket/target/c1/prod_dwh/user_history

6. s3_select_condition: The PySpark transformation to select the count of rows to be updated for the current batch of RTF users in S3.

7. s3_update_condition: The PySpark code to update the rows for the current batch of RTF users in S3.

8. run_order: Sets an order for tables to run, starting with ‘1’ running first. Multiple tables can have the same number if it doesn’t matter in what order PII data is forgotten. This is necessary if a table has PII data which is needed to join for another table running in the RTF process (e.g. users tables hold emails which are needed to join to identify users in PayPal tables.

9. update_count_limit: For each day of RTF user requests, what is the limit of rows which can be updated for this table? This threshold is to prevent the process from forgetting too many users (e.g. a data source error impacting joins). The default value is 500

10. process_flag_s3: Should the process run for this table in S3?default value is “Y”

11. process_flag_snowflake: Should the process run for this table/ entity in Snowflake? the default value is “Y”

A simple RTF Metadata config file looks like the following whereas it could be as complicated as this one.

Sample RTF configuration

Global Configuration

The RTF process on the cloud framework supports various run configurations to support the design goals, which can be configured by setting flags in any combinations, which is configured in ini the file, here is one such example.

  • dry_run_flag(Y/N): If this is set to ‘Y’ then the RTF process will give you the number of to be deleted RTF users and log that information into audit_log the table, but doesn’t actually update the target table to forget sensitive data and doesn’t send MBUS acknowledgement. Very useful for testing. The default value is: ‘Y’. Set to ‘N’ to forget sensitive data and send MBUS ack.
  • rtf_full_run(Y/N): If this is set to ‘Y’ then the RTF process will process all the RTF usersaccountId to date and send MBUS acknowledgement by default. The default value is ‘N’. This is very useful for replaying RTF user requests.
  • rtf_full_run_from_date(1970–01–01) : This variable is only activated or used when rtf_full_run=Y. So that we can process accountid from this specific date from historical RTF requests data. The value should be in YYYY-MM-DD format. Default value is 1970-01-01.
  • send_mbus_ack(Y/N): It is an additional flag that can be used to disable the MBUS acknowledgement for a particular RTF process run. This could be useful if we are doing a full process and don’t want to send MBUS ack as it might create duplicates downstream. The default value is ‘Y’
  • include_entity_name_product(None) : Only process these entities. Can pass comma-separated values. Useful for testing. Default value is None
  • exclude_entity_name_product(None) : Don’t process these entities. Can pass comma-separated values. Useful for testing. The default value is None

Audit Log

For traceability purposes, we maintain an audit log to save any transaction (user data deletion) on our data lake, so that we can track it in case of any future issues.

The Code Structure

The heart of the framework is in dependencies the codebase.

Below is part of the codebase to get an idea of the structure.

dependencies/
├── __init__.py
├── core_process.py
├── core_service.py
├── messagebus
│ ├── __init__.py
│ ├── mbus_producer.py
│ └── mbus_thrift
│ ├── __init__.py
│ ├── constants.py
│ └── ttypes.py
├── rtf_helper_functions.py
├── s3_process.py
├── s3_service.py
├── schema_types.py
├── snowflake_process.py
├── snowflake_service.py
job/
├── run_rtf_process.py

The code was developed following the “factory method” design pattern so that we can easily extend the framework easily in future to support new storage systems for e.g GCS / Google Bigquery/ Azure Blob etc. In this post, I will focus on the factory class (core_process.py) and interface class (core_service.py) without going into the details about the implementation, specific to AWS S3 or Snowflake DB.

core_process.py

The CoreProcess declares the factory method RTF process, abstracting away the RTF logic from the services (S3/ snowflake) class. In this class, all the common methods are defined and check out the functiondef run_rtf_process(self) , which has all the logic related to the RTF process. More or less the algorithm looks like the following. Here is the redacted version of the code for easy understanding.

RTF Process Flow

core_service.py

The CoreService declares the interface, which is common to all objects that can be supported by the CoreService and its subclasses like (s3_service or snowflake_service) class. Here is the redacted version of the code for easy understanding.

run_rtf_process.py

This is the entry point to execute the RTF process based on all the configurations supplied. Based on the storage type it decides what service (S3 / Snowflake) to be called. Here is the redacted version of the code for easy understanding.

RTF Requests Acknowledgement

After processing the whole batch of RTF requests for a particular execution, we publish the acknowledgement to the MBUS again so that it can be collected, audited and reported back to the users that we have successfully removed all the PII data from the Data platform.

One such example of acknowledgement looks like the below for AWS S3. whereas serviceId is “snowflake” to segregate two storage services while publishing the messages

{
"serviceId":"datalake",
"accountId":"xxxxxxxx-dca1-11xx-b399-xxxxxxxxxxxx",
"erasedAt":"2022-05-26T05:07:39Z",
"publishedAt":"2022-05-26T05:50:31.754Z"
}

Conclusion

In this blog, I have explained the concept and design of RTF framework from scratch to tackle Privacy laws at a petabytes scale.

There are many potential improvements I have in mind which I will publish in future posts. So stay tuned!

Any suggestions, comments or questions are welcome!

--

--