Protecting Sensitive Data with CDC in Databricks

Published in

DataReply

6 min readMay 15, 2024

Databricks is becoming the preferred choice for all kinds of companies that are interested in exploiting their big data reserves for business-intelligent decision-making. Founded by the creators of Spark, this enterprise-level platform integrates a wide range of data engineering and ML resources for the development of your data applications.

Picture by Agrarowth, licensed under CC BY-SA 4.0 DEED

Describing all the features that Databricks offers would result in this article being too long to read, as they are constantly releasing new functionalities and refactoring current services to improve user experience. For the sake of brevity, we will mainly focus on the description of some of Databricks' capabilities that were employed for our Secure CDC application. But before that, let us first explain what CDC stands for and why it is important.

Change Data Capture

Traditional database engines are designed to optimally and accurately register all incoming transactions in a system. Especially critical are the operations that introduce changes in the database, which must be accurately applied in sequence in order to maintain data consistency. Analytical systems, on the other hand, focus on executing complex queries and transformations with the existing data, prioritizing reading over writing performance which may lead to unsynchronized data updates. Data consistency controls often have to be implemented as part of the ingestion processes for unstructured or semi-structured data sources. This mechanism of correctly capturing and applying data modifications is commonly known as Change Data Capture (CDC).

Two key Databricks features aided us in deploying our CDC pipeline: Delta Live Tables and Autoloader.

Delta Live Tables (DLT) is a powerful declarative framework built over Spark for creating real-time data processing pipelines. Although Spark supports four different languages: Python, SQL, R, and Scala; by the time of this writing, DLT pipelines can only be described using either Python or SQL.

DLT employs Autloader to ingest new data from cloud locations into the pipeline automatically. Autoloader can load many types of files: JSON, CSV, XML, PARQUET, AVRO, ORC, TEXT, and BINARYFILE formats, from varied cloud providers: S3, GCP, Azure, or DBFS. It also supports customized options, depending on the file type and cloud location, easing the process of properly reading and formatting the input data. We used Autoloader to capture the changes that needed to be applied to our pipeline.

Now, let us dive into our actual use case.

Protecting Sensitive Information

We used Change Data Capture within our DLT pipeline to create a generalizable GDPR-compliant data application. By GDPR-compliant we mean that two specific conditions had to be met:

Only authorized users can query sensitive data.
Sensitive data must be removed from our database on demand.

GDPR is the European Standard for Data Protection. Photo License: CC BY-ND

To accomplish these conditions, we first needed to identify the sensitive data and process it, to prevent unauthorized users from accessing it. This concept is known as Pseudonymization. To pseudonymize sensitive data we used Spark's built-in AES functions to encrypt it with a unique encryption key per entity. For the second requirement, we created a process that hashes the encryption keys of the entities that should be removed from the database, so they can't be accessed anymore, not even by authorized users. Let us examine the process further in the following sections.

Detailed Description of the Workflow

For testing our application, we wrote a preparatory notebook to create and store dummy customer information in S3, for the pipeline to process. This notebook also created another folder in the same cloud location for the delete requests (basically a list of customer IDs who requested to delete their sensitive information from the database).

The DLT pipeline's main function is updating the encrypted customer table using CDC for the encryption keys table. You can view a representation of the pipeline in the following picture:

As shown in the image, the pipeline contains the following elements:

The customers and delete_requests views use Autoloader to load the respective information into the pipeline.
The encryption_keys_v view queries the current state of the encryption_keys table, so it becomes easier to identify the changes to apply.
The new_keys view compares the customers and encryption_keys_v views to identify new customers entering the database and generates encryption keys for these new customers.
The delete_keys view identifies the customers whose encryption keys should be hashed by reading the delete_requests view. Then it hashes these encryption keys taken from encryption_keys_v and new_keys.
The keys_to_upsert view merges the delete_keys and new_keys views.
For the encryption_keys table, DLT's CDC apply_changes function is used to upsert the records from the keys_to_upsert view.
The quarantine view then encrypts the sensitive fields from customers using the corresponding encryption key from the encryption_keys table.
Lastly, encrypted_customers is the final output table of the pipeline. It materializes the quarantine view.

The apply_changes function is the core component of our pipeline for handling CDC. It receives the following mandatory arguments:

target: Table that will be updated, in our case, the encryption_keys table.
source: Table or view containing the changes to apply. This is our keys_to_upsert view.
keys: Primary key to match the rows to update or insert. We will be using the customer IDs here.
sequence_by: Column that orders the incoming changes to be processed. We are using the timestamp in which the events were registered.

This function supports additional optional arguments, such as column_list with the output columns to keep, or apply_as_deletes, which receives a condition for deleting records (we did not use it in our case).

Postprocessing Job

So far we have covered the encryption of sensitive data and its deletion from the database, but how are we granting access to this data to authorized users? In the following manner: we created a postprocessing job that creates a dynamic view retrieving the encrypted or decrypted data depending on the user who is querying. This postprocessing job also vacuums the encryption_keys table after hashing its keys. This last step was needed to restrict delta live tables' time-traveling feature, which enables the recovery of encryption keys after they are hashed if previous table versions are queried.

Having described the main components of our pipeline, let's take a look at our final result.

Architecture Overview

To ensure the sequential execution of the DLT pipeline followed by the postprocessing job, we created an overarching workflow (named GDPR Job) that encompasses both of these components and defines their dependencies. As a final ingredient of our application, we developed an API using AWS Lambda Functions and Databricks' Statement Execution API to serve the output data to our users. Lastly, the deployment of all these Databricks and AWS resources was scripted and parameterized using Terraform and Terragrunt respectively.

The following image will give you a better understanding of our final application:

Final Remarks

The employment of Terraform and Terragrunt allowed us to create a portable GDPR-compliant ETL pipeline template with CDC support. For integration and QA testing, we also created a test suite for our application, writing a notebook similar to the preparatory one. This notebook works by storing data in the corresponding paths for testing different scenarios, then triggers the GDPR Job, and checks that the outputs correspond to what is expected.

In summary, we have demonstrated that Databricks provides all the necessary components for developing a secure and robust data processing application. Nevertheless, the execution of this project was not trivial, we faced some compatibility and integration issues that had to be worked around by modifying our original architecture. Fortunately, many of these issues have already been addressed or are being actively addressed by Databricks for future deployments.