Streamlining GDPR Compliance: Automating Subject Access Request with BigQuery and DBT

Sai Kumar Devulapelli
Odicis-Data-Engineering
7 min readNov 9, 2023
Image generated from DALLE-3

Introduction:

In the era of data-driven decision-making, data engineering has become the backbone of enterprise operations. However, with great power comes great responsibility — specifically, the responsibility of compliance with data protection regulations like the General Data Protection Regulation (GDPR). A crucial component of GDPR compliance is managing Subject Access Requests (SAR). As data landscapes grow more complex, automating SAR processes using cutting-edge tools like BigQuery and DBT (data build tool) has become not just advantageous, but essential.

The Imperative of SAR in Data Engineering

Subject Access Requests are a cornerstone of GDPR, granting individuals the right to access their personal data held by an organization. In the labyrinth of modern data architecture, fulfilling these requests can be daunting. This is where data engineering steps in. The capability to swiftly navigate through tables, columns, and databases to compile a comprehensive record of an individual’s data is a non-trivial task that sits at the heart of data engineering.

Why DBT is Pivotal for Data Engineering

DBT stands as a transformative tool in the realm of data engineering. It enables engineers to transform data in-warehouse through simple SQL scripts while adhering to software engineering best practices like modularity, version control, and testing.

For GDPR compliance, DBT’s power to build, test, and deploy complex data models becomes invaluable. It makes the process of locating and compiling data spread across myriad sources both efficient and reliable.

The Journey from SAR Request to Compliance

The SAR process begins when a request is made, perhaps via email, a call center, or a form on the company’s website. The compliance team then steps in to verify its authenticity. Once verified, the challenge of collating all related data arises. The data in question is often scattered across multiple tables and databases, which could be a Herculean task to sort manually. Fortunately, if all the data is centralized in a platform like BigQuery or Snowflake, the path to compliance is less obstructed.

Data engineers are tasked with the technical execution of these requests. They must create queries that can traverse across the complex schema of their data warehouse and pull out all pieces of data associated with the individual. This is where automation comes into play.

Automating SAR with BigQuery and DBT

With BigQuery serving as a highly scalable and serverless data warehouse, and DBT assisting in transforming data with ease, automating SAR is a logical progression. Here’s how data engineers can streamline the SAR response process.

1. Receiving and Verifying the SAR:

A SAR may arrive via email, call centers, or a custom form on the company’s website. The first step is always to verify the request’s legitimacy.

2. DBT Modeling:

Using DBT, engineers can create models that define the logic to extract the relevant data. These models act as templates and can be reused whenever a SAR is received, thereby saving significant time and reducing the risk of human error.

3. Selecting Relevant Data:

Once the DBT models are run, they collate all the necessary data into a cohesive format. BigQuery’s processing power allows this to happen at remarkable speed, even with large and complex datasets.

4. Outputting Data:

The final step in the automation process is to export the compiled data into a Google Cloud Storage (GCS) bucket. From here, the data can be collected by a downstream system that will format it and securely transfer it to the individual who made the request.

Implementation Steps for Automating SAR with BigQuery and DBT

Setting Up the DBT Project:

  1. Installation and Initialization:
  • Begin by installing dbt. Once installed, create a new project by running the command dbt init [project-name] in your terminal.

2. Configure Connection:

  • Navigate to your project directory. Open the profiles.yml file to configure your BigQuery connection details, ensuring your dataset field points to the dataset where your transformed data will reside.

3. DBT Project Configuration:

  • The dbt_project.yml is your roadmap for how dbt will interact with BigQuery. Within this file, specify your model-paths, which direct dbt to your SQL models that define your data transformations.

Building Data Marts with Diverse Data Sources:

Leverage SQL to merge different staging models (like stg_customers, stg_orders, stg_order_payments) into a comprehensive customer data overview. Utilize dbt’s powerful macros to craft modular code segments for efficient data handling. Implement a systematic approach for labeling sensitive data within your models, ensuring adherence to GDPR standards.

In the marts folder, develop specific SQL files for each of your data marts models, such as models/marts/customer_info.sql. Employ dbt’s configuration tools to strategically determine how these models are materialized.


{{ config(materialized="view" }}

WITH stg_customers as (
select * from {{ ref("stg_customers") }}
),
stg_orders as (
select * from {{ ref("stg_orders") }}
),
stg_order_payments as (

select * from {{ ref("stg_order_payments")}}

),
joined_data as (

SELECT t1.customer_id,
t1.customer_first_name,
t1.customer_last_name,
t1.customer_dob,
t1.customer_address,
t2.customer_phone_number,
t2.order_delivered_customer_info,
t3.payment_details
FROM stg_customers t1
LEFT JOIN stg_orders t2
ON t1.customer_id = t2.customer_id
LEFT JOIN stg_order_payments t3
ON t2.order_id = t3.order_id
)
select * from joined_data
Lineage graph by Author

Configuring which column should be exported

We decided to add a custom tag called gdpr_sar to the DBT model configuration, to define which columns contain the customer data that will be exported.
In the model YAML file, use the meta property to tag columns that require gdpr_sar: true.

version: 2

models:
- name: customer_info
description: "A table for customers_processed data"
columns:
- name: customer_first_name
description: "This column indicates customer's first_name"
meta:
gdpr_sar: true
- name: customer_last_name
description: "This column indicates customer's last_name"
meta:
gdpr_sar: true
- name: customer_dob
description: "This column indicates customer's Date of Birth details"
meta:
gdpr_sar: true
- name: customer_address
description: "This column indicates customer's address"
meta:
gdpr_sar: true
- name: customer_phone_number
description: "This column indicates customer's phone number"
meta:
gdpr_sar: true
- name: order_delivered_customer_info
description: "This column indicates the delivered customer_details"
meta:
gdpr_sar: true
- name: payment_details
description: "This column indicates the payment details of customer"
meta:
gdpr_sar: true

Creating the GDPR Macros

Develop custom dbt macros to interpret these tags and apply the appropriate data. These macros are pivotal for locating and aggregating columns tagged for GDPR.

{% macro lookup_column_meta(tag_name = None) %}
{% set selected_column= [] %}

{% for node in graph.sources.values() | list + graph.nodes.values() | list %}
{% for col in node.columns.values() %}
{% if tag_name in col.meta and col.meta[tag_name] %}
{{ selected_column.append((col.name, node)) }}
{% endif %}
{% endfor %}
{% endfor %}

{% do log('Columns with gdpr_sar: ' ~ selected_column, info=True) %}

{% if selected_column | length == 0 %}
{{ exceptions.raise_compiler_error("No columns with 'gdpr_sar' metadata found.") }}
{% endif %}

{{ return(selected_column) }}

Craft a macro named export_gdpr_sar that handles the exporting of data to a GCS bucket for SAR compliance. This macro constructs the SQL statement for BigQuery’s EXPORT DATA function, targeting the necessary columns for a specific customer ID, and then executes the query.

{% macro export_gdpr_sar() %}
{% set sar_columns = lookup_column_meta(tag_name='gdpr_sar') %}
-- This is the customer_info view that contains all customers information
{% set node = sar_columns[0][1] %}

{% set sql %}
EXPORT DATA OPTIONS(
uri="{{ var('gcs_bucket') }}/{{ var('requested_customer_id') }}_*.csv",
format="CSV",
overwrite=true,
header=true,
field_delimiter=";")
AS
WITH selected_data AS (
{% if sar_columns %}
SELECT {{ sar_columns | map(attribute=0) | list | join(',')}}
FROM {{ node.database }}.{{ node.schema }}.{{ node.name }}
WHERE customer_id = '{{ var("requested_customer_id") }}'
{% else %}
SELECT 'No gdpr_sar columns found' AS error_message
{% endif %}
)
SELECT * FROM selected_data;
{% endset %}

{% do log('Generated SQL: ' ~ sql, info=True) %}
{% do run_query(sql) %}
{% endmacro %}

With the macros and models in place, use dbt’s run-operation command to trigger the SAR export process. This operation utilizes the established macros to compile and export the data into the requested format and location.

dbt run-operation export_column_meta --vars '{"requested_customer_id": "8611feeeaa6d278ec4b4a5e4ae947fca"}'
Screenshot by Author

Here is the sample data for customers_info view exported as per the request.

By following these refined implementation steps, you can leverage dbt’s capabilities alongside BigQuery to automate the SAR process, thereby ensuring that GDPR compliance is not only adhered to but also seamlessly integrated into the data engineering workflow.

Conclusion

In the modern data ecosystem, where agility and compliance are paramount, automating SAR with BigQuery and DBT represents a significant stride forward. It exemplifies the kind of smart, scalable solutions that are necessary to navigate the complexities of GDPR compliance. Data engineers equipped with these tools can ensure that their organizations respond to SARs quickly, accurately, and efficiently — upholding the trust of their customers and the integrity of their operations.

By embracing the automation of SAR, businesses can address one of the many challenges presented by GDPR, turning what could be a compliance headache into a seamless part of their data operations. The integration of these advanced technologies ultimately signifies a proactive step towards responsible and effective data governance.

We hope this journey has equipped you with valuable insights into GDPR compliance automation. Your engagement is our motivation!

Thank you for being a part of our community! Before you go:

  • Be sure to clap and follow the writer! 👏
  • You can find even more content at Odicis 🚀
  • Follow us on LinkedIn.

--

--