Protecting Sensitive Data with CDC in Databricks
Databricks is becoming the preferred choice for all kinds of companies that are interested in exploiting their big data reserves for business-intelligent decision-making. Founded by the creators of Spark, this enterprise-level platform integrates a wide range of data engineering and ML resources for the development of your data applications.
Describing all the features that Databricks offers would result in this article being too long to read, as they are constantly releasing new functionalities and refactoring current services to improve user experience. For the sake of brevity, we will mainly focus on the description of some of Databricks' capabilities that were employed for our Secure CDC application. But before that, let us first explain what CDC stands for and why it is important.
Change Data Capture
Traditional database engines are designed to optimally and accurately register all incoming transactions in a system. Especially critical are the operations that introduce changes in the database, which must be accurately applied in sequence in order to maintain data consistency. Analytical systems, on the other hand, focus on executing complex queries and transformations with the existing data, prioritizing reading over writing performance which may lead to unsynchronized data updates. Data consistency controls often have to be implemented as part of the ingestion processes for unstructured or semi-structured data sources. This mechanism of correctly capturing and applying data modifications is commonly known as Change Data Capture (CDC).
Two key Databricks features aided us in deploying our CDC pipeline: Delta Live Tables and Autoloader.
Delta Live Tables (DLT) is a powerful declarative framework built over Spark for creating real-time data processing pipelines. Although Spark supports four different languages: Python
, SQL
, R
, and Scala
; by the time of this writing, DLT pipelines can only be described using either Python
or SQL
.
DLT employs Autloader to ingest new data from cloud locations into the pipeline automatically. Autoloader can load many types of files: JSON
, CSV
, XML
, PARQUET
, AVRO
, ORC
, TEXT
, and BINARYFILE
formats, from varied cloud providers: S3, GCP, Azure, or DBFS. It also supports customized options, depending on the file type and cloud location, easing the process of properly reading and formatting the input data. We used Autoloader to capture the changes that needed to be applied to our pipeline.
Now, let us dive into our actual use case.
Protecting Sensitive Information
We used Change Data Capture within our DLT pipeline to create a generalizable GDPR-compliant data application. By GDPR-compliant we mean that two specific conditions had to be met:
- Only authorized users can query sensitive data.
- Sensitive data must be removed from our database on demand.
To accomplish these conditions, we first needed to identify the sensitive data and process it, to prevent unauthorized users from accessing it. This concept is known as Pseudonymization. To pseudonymize sensitive data we used Spark's built-in AES functions to encrypt it with a unique encryption key per entity. For the second requirement, we created a process that hashes the encryption keys of the entities that should be removed from the database, so they can't be accessed anymore, not even by authorized users. Let us examine the process further in the following sections.
Detailed Description of the Workflow
For testing our application, we wrote a preparatory notebook to create and store dummy customer information in S3, for the pipeline to process. This notebook also created another folder in the same cloud location for the delete requests (basically a list of customer IDs who requested to delete their sensitive information from the database).
The DLT pipeline's main function is updating the encrypted customer table using CDC for the encryption keys table. You can view a representation of the pipeline in the following picture:
As shown in the image, the pipeline contains the following elements:
- The
customers
anddelete_requests
views use Autoloader to load the respective information into the pipeline. - The
encryption_keys_v
view queries the current state of theencryption_keys
table, so it becomes easier to identify the changes to apply. - The
new_keys
view compares thecustomers
andencryption_keys_v
views to identify new customers entering the database and generates encryption keys for these new customers. - The
delete_keys
view identifies the customers whose encryption keys should be hashed by reading thedelete_requests
view. Then it hashes these encryption keys taken fromencryption_keys_v
andnew_keys
. - The
keys_to_upsert
view merges thedelete_keys
andnew_keys
views. - For the
encryption_keys
table, DLT's CDCapply_changes
function is used to upsert the records from thekeys_to_upsert
view. - The
quarantine
view then encrypts the sensitive fields fromcustomers
using the corresponding encryption key from theencryption_keys
table. - Lastly,
encrypted_customers
is the final output table of the pipeline. It materializes thequarantine
view.
The apply_changes
function is the core component of our pipeline for handling CDC. It receives the following mandatory arguments:
target
: Table that will be updated, in our case, theencryption_keys
table.source
: Table or view containing the changes to apply. This is ourkeys_to_upsert
view.keys
: Primary key to match the rows to update or insert. We will be using the customer IDs here.sequence_by
: Column that orders the incoming changes to be processed. We are using the timestamp in which the events were registered.
This function supports additional optional arguments, such as column_list
with the output columns to keep, or apply_as_deletes
, which receives a condition for deleting records (we did not use it in our case).
Postprocessing Job
So far we have covered the encryption of sensitive data and its deletion from the database, but how are we granting access to this data to authorized users? In the following manner: we created a postprocessing job that creates a dynamic view retrieving the encrypted or decrypted data depending on the user who is querying. This postprocessing job also vacuums the encryption_keys
table after hashing its keys. This last step was needed to restrict delta live tables' time-traveling feature, which enables the recovery of encryption keys after they are hashed if previous table versions are queried.
Having described the main components of our pipeline, let's take a look at our final result.
Architecture Overview
To ensure the sequential execution of the DLT pipeline followed by the postprocessing job, we created an overarching workflow (named GDPR Job) that encompasses both of these components and defines their dependencies. As a final ingredient of our application, we developed an API using AWS Lambda Functions and Databricks' Statement Execution API to serve the output data to our users. Lastly, the deployment of all these Databricks and AWS resources was scripted and parameterized using Terraform and Terragrunt respectively.
The following image will give you a better understanding of our final application:
Final Remarks
The employment of Terraform and Terragrunt allowed us to create a portable GDPR-compliant ETL pipeline template with CDC support. For integration and QA testing, we also created a test suite for our application, writing a notebook similar to the preparatory one. This notebook works by storing data in the corresponding paths for testing different scenarios, then triggers the GDPR Job, and checks that the outputs correspond to what is expected.
In summary, we have demonstrated that Databricks provides all the necessary components for developing a secure and robust data processing application. Nevertheless, the execution of this project was not trivial, we faced some compatibility and integration issues that had to be worked around by modifying our original architecture. Fortunately, many of these issues have already been addressed or are being actively addressed by Databricks for future deployments.