Building a PII Firewall with Delta Live Tables

13 min readJul 15, 2022

GDPR — 4 small letters to type, 1 huge problem for data teams. — **GDPR** — 4 small letters to type, 1 huge problem for data teams.

A vast and evolving problem space

From solutions for climate change, to life-saving COVID treatments and finding cures for cancer, data teams today are solving the world’s toughest problems. But in order to solve these problems, they need data, and most of the time getting their hands on that data means that they have to contend with another huge problem: how do they stay on the right side of data privacy laws like the GDPR and CCPA?

There’s no easy answer to this — if any vendor tells you that their product provides GDPR-as-a-service then I would ask them if they’re selling magic beans too.

Why not just use Vendor A’s magic GDPR beans?

Now, I should quickly follow that up by acknowledging that my colleagues at Databricks have blogged in the past about how Delta Lake can help you to Avoid Drowning in GDPR Data Subject Requests by implementing ‘Right to be Forgotten’ requests, and that’s definitely true. But these blogs have largely focussed on the final step — how to apply a data subject request — zooming past all of the complexity that comes with finding out which tables you need to apply it to in the first place.

In reality, most data platforms are made up of many workspaces, each with their own metastore, and those metastores typically contain thousands of databases and hundreds of thousands of tables. And it’s not just the number of tables that makes solving for data privacy such a vast problem space. Volume, velocity, veracity and variety are all important factors too. It’s easier to search for and apply a data subject request to a 1TB table that’s changing infrequently than it is to a 170PB table that’s changing every second. Equally, it’s easier to search for and delete or redact PII from structured datasets than it is for unstructured data like images (or harder still - video and audio).

And that’s all without considering different classifications of PII:

Personal Data
Special Category Data
PHI Data
Cardholder data processed under PCI-DSS

Or different subject types:

Natural living persons
Children
Populations

All of which come with different regulations and data management requirements.

Finally, the data privacy laws themselves are an ever changing landscape, often based on interpretation, and as such our understanding of them is constantly evolving too.

Simplifying the problem

Ok, so we’ve established that staying on the right side of data privacy laws is a complex data problem in its own right, but let’s try and simplify it by unpacking the 3 foundational capabilities you’re going to need in order to stay compliant:

Identify — you need to be able to identify whether you have PII in the first place, and if so, where it is stored.
Protect — You need to determine what actions you need to take to protect that PII. The decisions you make here are going to have huge implications on how easy or hard the downstream management of that data is going to be.
Manage — at some point in the future you may need to find and apply specific actions (such as DSRs) to this PII.

The 3 foundational capabilities you need to safely process PII data

Sounds straightforward, right? Well, yes and no. At every stage the decisions you make are going to make life much easier or harder for you or someone else. Lets take a look at the 3 main options for protecting PII and what they mean for downstream processing of that data:

1. Do nothing — which is likely to make downstream management much harder because in future you may need to be able to find all instances of that PII across all of your datasets and apply the appropriate action to it.

2. Pseudo-anonymise — which is likely to make downstream management easier because throughout the majority of your data the PII should be replaced with a pseudo-anonymous identifier, meaning that Data Subject Requests (DSRs) like right to be forgotten requests can be satisfied by removing the entry from a far smaller subset of tables (often called token vaults) which are the only place the identifier can be successfully matched to its associated item of PII

3. Anonymise — if done correctly this is likely to make downstream management much easier, since fully anonymised data cannot be attributed back to a natural living person, and therefore is not typically subject to data privacy laws like GDPR. It’s important to note, however:

It’s hard to have absolute certainty that a dataset has been fully anonymised, especially considering that there have been several high profile examples in which datasets that were thought to have been “anonymised” were able to be de-anonymised when joined to other seemingly unrelated (or at least semi-related) datasets.
You need to try to make sure that during the process of anonymisation valuable signals in the data have not been lost.

If you want to find out more about the various options available, I’d recommend checking out the summary given by my colleague Serge Smertin at Data and AI Summit Europe 2020.

Now that we’ve established the three core capabilities that we need, let’s consider each in turn, looking at some easy to implement patterns that we can follow now and that will help, as well as introducing the concept of what we mean by a “PII Firewall’’.

1. Identify

The first step is knowing what PII you have in the first place. Realistically you have two options here:

You scan your data for PII on entry
You scan or crawl your data for PII regularly after it has been ingested

For the purposes of this blog, we’re largely going to focus on 1, but the solutions we propose could easily be adapted to run as a scheduled job that’s used to scan existing datasets too.

Building a PII Firewall with Delta Live Tables

Delta Live Tables (DLT) is in the words of my colleague Jason Pohl “the last upgrade to your data pipeline you’ll ever need to make”. In addition to making it easy to build and manage reliable data pipelines, DLT allows you to define and enforce data quality expectations on your data. These expectations prevent bad data from flowing into your tables by evaluating that data against predefined rules. You can choose how to handle any rows that fail by specifying an appropriate action (fail, drop, alert or quarantine), and what’s more those expectations are as flexible as the language they’re written in — SQL — making them great for handling a huge number of different types of data quality or detection scenarios.

In this example we’re going to use a combination of regular expressions and data expectations to build a “PII Firewall.” Regular expressions are a common pattern for identifying PII in structured or semi-structured data, and in my experience they’re still the main method of PII identification used by most enterprise tools in this space. Essentially think of the regular expressions as our firewall rules, and DLT as the engine that evaluates and enforces these rules against incoming data.

Regular expressions are a common pattern for identifying PII in structured or semi-structured data

The code that we’re using can be checked out from this repo, and you can follow the setup steps to get it running in your environment.

Once it’s run, you should see something like this:

A DLT pipeline for automatically detecting and redacting PII data

As you can see, the VIEWS used to load our data before it has been evaluated against our expectations are temporary, meaning no data that may potentially contain PII is persisted without our consent after the pipeline has been run. Data is then split out into clean and redacted tables before being union-ed into a single output table that contains either data that has passed our expectations, and therefore is not expected to contain PII or data that has failed our expectations and therefore has been protected based on a set of configurable actions.

The expectations and actions are defined in the configuration file provided. Each expectation (labeled constraint) contains a regex specifically designed to identify PII, and an action used to redact it. The default configuration file contains examples for the following types of PII:

Credit Card Number*
Card Expiry Date*
Card Security Code* (likely to generate false positives on integer fields)
IBAN
Email Address
IPV4 Address
IPV6 Address
MAC Address
Phone Number (based on a specific format)
Date of Birth (likely to generate false positives where the data contains dates prior to the millenium)
Age (likely to generate false positives on integer fields)
US Address (based on a specific format)
US SSN

*Largely used for illustrative purposes. If you’re processing card holder data, you should evaluate your requirements carefully against the relevant compliance controls (I.e. PCI-DSS).

The pipeline will dynamically generate a suite of expectations based on the schema of the data you point it at. If that schema contains structs of nested data, it will flatten the first layer of those so that the expectations can be applied uniformly across all of the fields contained within them. It will then dynamically generate SQL expressions that apply the specified action to any columns that fail an expectation.

When we run this against some fake data generated by this notebook (with 10,000,000 records that contain PII and 4 that do not) we should see the following:

Our regex based expectations have successfully detected PII

As you can see, we’ve successfully identified 10 million rows of PII in about 15 minutes on a relatively modest cluster!

2. Protect

As part of our configuration file we also declare a “mode”. There are 3 possible modes:

REDACT
REDACT_AND TAG
TAG

If the mode associated with a particular expectation is REDACT or REDACT_AND_TAG (and that expectation fails) as mentioned above, we will apply the action specified in the configuration file. This can be any SQL expression, but in this context is normally one designed to redact or anonymise our data. When we apply the actions defined in our configuration file we should see the following results:

We can use SQL based expressions to automatically anonymise or pseudo-anonymise the PII we have detected

As you can see, we’ve not only managed to identify PII, but also protect it based on the actions specified!

3. Manage

Management is by far the hardest part of staying on the right side of data privacy laws. Managing PII data entails:

Being able to find the tables which may contain PII
Being able to find the PII in question
Being able to apply the necessary actions to that PII

As we established earlier, there are lots of great examples that focus on how to apply DSRs in a reliable and performant way, but how do we figure out which databases, tables and columns we need to apply them to in the first place? Well, we have a solution for that too. After the DLT pipeline has completed, we can schedule an additional tagging notebook to run, orchestrating the two tasks to run in sequence via multi-task job workflows:

Databricks workflows lets us chain multiple tasks together

This tagging notebook adds PROPERTIES and COMMENTS to all databases and tables that have been scanned:

We can add properties to our database to say that they have been scanned for PII

We can add properties to our tables to say that they have been scanned for PII

In addition to a COMMENT (defined as a tag in the configuration file) to every column that has been flagged as potentially containing PII:

We can add comments to every column so that it is clear what PII has been discovered, and what action has been taken against it

Using these properties and comments it’s possible to programmatically search across all databases, tables and columns, identifying those which may contain PII, and more importantly whether they’re the types of PII that you’re interested in.

Why good governance starts with Unity Catalog

Ok, what we’ve covered so far is pretty cool, but at the beginning you said “most data platforms are made up of many workspaces, each with their own metastore, and those metastores typically contain thousands of databases and hundreds of thousands of tables.”

That’s a lot of places I need to manage my data!

I hear you, and that’s one of the reasons Databricks has built Unity Catalog (UC). UC provides fine-grained governance for all of your data and AI assets. With UC, customers can link all of their workspaces to a central, unified governance layer, providing fine grained access and secure data sharing without having to duplicate the data.

Create a single source of truth for your data with Unity Catalog

And in the words of Lt. Col. Frank Slade in my favourite film — Scent of a Woman — the UC team are “just getting warmed up!” As per the announcements at Data and AI Summit 2022, some of the great things to come are:

General availability — in the coming weeks!
Advanced search features — so you can search across all of your data and AI assets
Attribute Based Access Control (ABAC) — Simplify governance at scale with a simple yet powerful tagging system to manage access to your data
End-to-end data lineage — automatically capture runtime data lineage, down to column and row level, providing data teams an end-to-end view of how data flows in the lakehouse
Primary and Foreign Key declarations — to help end users to understand the relationships between tables
INFORMATION_SCHEMA — provide privileged users with a DBA level view over your Lakehouse

End-to-end data lineage with Unity Catalog

Let’s take a step back (or rather forwards) and allow ourselves to imagine how these features might help us to manage our data and AI assets in the near future:

We scan our data for PII on entry, in much the same way as we have demonstrated in this blog
Where the data is considered to contain PII, we redact some fields but also add the attributes “pii” and the type of pii
Only privileged users with the attribute “pii” can see these columns, for all other users who query the same data product the column is hidden, masked or redacted
When we receive a DSR request, we can easily search across all of our data assets to find which tables may contain PII, as well as the types of PII that we’re interested in
We can use the end-to-end data lineage, PKEY and FKEY declarations, as well as our INFORMATION_SCHEMA and to easily build a picture of all of the downstream tables that might contain this type of PII
Now we can easily apply the DSR request to each of the tables in a reliable and performant way

If that sounds like a future you want to be a part of, sign up for the Public Preview of Unity Catalog now, or stay tuned for the General Availability announcement in the coming weeks!

Like Frank Slade in Scent of a Woman, the UC team are “just getting warmed up!”

Next Steps

In this blog we’ve demonstrated that you can use an approach similar to those used by many enterprise scanning and detection tools to identify PII in your data. But my ambition is to be able to get as close to GDPRaaS on the Lakehouse as possible. As such, my plans for the future are:

A firewall is only as good as its rules — the regular expressions provided are done so on an as-is basis and without any guarantees. That said, I’d love to test them against more data, refine their accuracy and add new regexes for PII types which aren’t covered yet. If you’re interested in collaborating on this, I’d love to hear from you!
80–90% of the world’s data is unstructured — whilst the approaches outlined in this blog work well for structured and pretty well for semi-structured data, unstructured data comes with its own, entirely different challenges! Stay tuned for more blogs on how you can automatically handle unstructured data in a GDPR and CCPA compliant way!
Regular expressions will only get us so far — firstly I’d like to dispel the myth that regular expressions are an outdated method of identifying PII — in my experience most of the enterprise platforms that I encounter in this field still rely on regexes for most of their PII detection. That said, it’s important to acknowledge that this approach is not without its limitations. If you used the notebook provided to generate your input data you’ll notice that there are two fields that our detection has failed pretty miserably on: name and freetext. Let’s unpack each one to understand why…

name: short of having a dictionary of every person’s name on the planet — in every language and every alphabet — detecting a person’s name via regex is pretty much an impossible job.

freetext — the way that expectations works means that if an expectation fails that row will not be evaluated against further expectations. It’s similar to error handling — if that row has failed, we mark it as failed and move on. The complication of this is that for fields that may contain multiple instances of PII (such as freetext fields), only the first type of PII found will be identified and redacted.

So what’s the answer here? I’d like to look at using more sophisticated approaches like Named Entity Recognition (NER) or Natural Language Processing (NLP) to improve the detection of specific PII types like names, addresses or businesses, and also to apply identification and redaction to the contents of an entire column at a time. Again, if you’re interested in collaborating on this, please let me know!

Credits

Huge thanks to monzy merza & lipyeow lim for their ideas, thought leadership and guidance throughout the development of this blog and to Justin Olsson for all of the above and for doing so much to help me and many others understand the fundamentals of data privacy laws like the GDPR!

Building a PII Firewall with Delta Live Tables

A vast and evolving problem space

Simplifying the problem

1. Identify

Building a PII Firewall with Delta Live Tables

2. Protect

3. Manage

Why good governance starts with Unity Catalog

Next Steps

Credits

Written by Andrew Weaver