Cape Python: Apply Privacy-Enhancing Techniques to Protect Sensitive Data in Pandas and Spark

Published in

Cape Privacy (Formerly Dropout Labs)

10 min readJul 31, 2020

We’re extremely excited to have recently released the Cape Python library. This library is one of the first building blocks to make your data science and machine learning pipelines privacy-preserving. As data scientists and data engineers, too often we unfortunately stumble on personally identifiable information (PII) that wasn’t necessarily relevant for our project. Other times, we needed access to a dataset, but it took months to obtain. Compliance teams pointed out a lack of transparency regarding how the sensitive data would be protected, how it would be used, and who would have access. For obvious ethical reasons, and increasing regulations such as GDPR, it’s becoming critical for data teams to make privacy a top priority when working on data science and machine learning projects. However, there is clearly a lack of accessible tools integrated with the current data science ecosystem to make it possible.

Cape Python offers several masking techniques to help you obfuscate sensitive data involved in internal data science and machine learning projects. We consider these techniques to be a first step in your privacy journey. We tried to make them as simple as possible so you can quickly start experimenting and thinking about how to make your projects more privacy-preserving. We built this library with the following objectives in mind:

Accessible: these techniques should be very easy to understand and apply at scale with popular data science libraries, such as Pandas and Spark.
Collaborative: data science, machine learning, governance and security teams should be able to collaborate effectively to define and edit data privacy policy through human-readable policy files.
Transparent: At any point, it should be possible to validate which privacy techniques have been applied to which dataset for which task.

In the rest of the blog post, we will introduce these masking techniques and give you some guidance to select the right technique depending on your use case. We will walk you through a concrete example where we will experiment with several techniques in Pandas, then define a data privacy policy, and finally apply this policy in a Spark pipeline. You can find the code associated with the blog here.

What are Cape Python’s masking techniques?

Cape Python offers several masking techniques to obfuscate identifiers (de-identification process) and sensitive information, included in the dataset. The appropriate method to use depends on the data type (string, numeric, date, etc.), the accuracy you need to maintain, and the type of identifiers:

Direct identifiers: information that relates specifically to an individual. For example: name, address, social security number, etc.
Indirect identifiers / quasi identifiers: Information that can be combined with other information to potentially identify a specific individual. For example: city, zip code, income, etc. As a concrete example, Latanya Sweeney’s work demonstrated that “87% of the U.S. population are uniquely identified by date of birth, gender and zip code”.

Here is the preliminary list of techniques:

Tokenizer: this maps each value to a token. If a value is repeated several times across the field, it always gets mapped to the same token in order to maintain the count. Typically, you would apply the tokenizer to a string field such as name.

Perturbation: add noise to numeric and date fields. This transformation gives you the ability to tune the amount of noise you’d like to add. For example, add a random value between -5 and 5 to the age field. If it’s a date, you can specify the amount of noise for each frequency (e.g, day, hour, minutes, etc.).

Rounding: reduce the precision of a numeric field by rounding it to the given number of digits. For example, you could decide to reduce the precision of the salary fields, or latitude/longitude field.

Redaction: delete certains rows or columns. You could decide, for example, to drop the street name and zip code fields from the dataset because it’s irrelevant to the task.

With these four simple techniques, you can already mask your sensitive data for data science tasks. However, we’d like to bring to your attention that although these techniques allow you to reduce individual privacy leakage, they don’t guarantee protection against all potential privacy attacks.

Pseudonymization is not enough to release data to untrusted audience

In 2006, Netflix launched a machine learning competition which consisted of predicting users’ ratings based on a set of previous ratings. Before publishing the dataset publicly, they carefully removed all the user information and perturbed several records (by minimally modifying rating dates, for example). Even though they took these precautions, in 2017 two researchers from The University of Texas at Austin demonstrated it was possible to re-identify individuals by leveraging the IMDB dataset as an external source of knowledge (this is known as a linkage attack).

The process of applying masking techniques to remove direct identifiers is called pseudonymization. With Cape Python you can obfuscate the indirect identifiers by perturbing the values or reducing their precision. However, as demonstrated by the Netflix example, these techniques will not prevent all privacy attacks, such as linkage attacks.

For this reason, it’s important to use these techniques only in an environment where the assumption of a trusted data user is satisfied. For example, when a data scientist is working internally on a credit risk modeling or fraud detection project. However these techniques are not sufficient to release a dataset publicly, or share it with an untrusted external organization.

For data science use cases in public or untrusted settings, we plan to support differential privacy in a later release of Cape Python. Differential Privacy allows us to quantify the privacy leakage and offer stronger privacy guarantees. Here is the definition by Cynthia Dwork, who co-invented differential privacy: “Differential privacy describes a promise, made by a data curator to a data subject: you will not be affected, adversely or otherwise, by allowing your data to be used in any study, no matter what other studies, data sets, or information from other sources is available.”. As you can see, in the context of the Netflix example, this definition guarantees an attacker wouldn’t be able to re-identify individuals in the Netflix dataset even if they perform a linkage attack using the IMDB dataset.

Let’s dive into a concrete example of using Cape Python in a data science pipeline.

Easily prototype your privacy-preserving pipeline in Pandas

It’s extremely easy to start experimenting with these masking techniques using Pandas. Let’s say you are a team of data engineers and data scientists who are working on a credit modelling task for a bank. You know that there is a credit dataset you could use but it includes some sensitive information about the clients. As a team, you’d like to experiment with these masking techniques on a mock dataset similar to the credit dataset to figure out how to better preserve privacy.

As an example, we will experiment with the public German credit card dataset. We just added some fake PII information (such as name, address, etc.) and quasi-identifiers to make it more similar to a real dataset for which we would use these techniques.

Voilà, a sample of the dataset:

Here are some questions you could ask yourself to figure out how to mask this dataset:

Who can view this dataset?
What are the direct and quasi identifiers?
Is there sensitive data irrelevant for the task I could redact right away?
Could I re-identify a person using the quasi identifiers (city, salary, age, etc.) and additional sources of information?
How much utility from this sensitive data do I need to maintain for this task?

One potential approach to protect the credit dataset could be to say that ‘name’ is irrelevant for the task, and therefore we should remove it. However, there’s another approach. We might want to count the number of unique people or clients in the dataset, or validate if there is one or several loans per client, and then validate the aggregation level of the dataset. For these reasons, you can apply the tokenizer, which will map each name to a unique token. Each name will be obfuscated, but the dataset will still maintain the correct user count.

The dataset also contains an individual’s address and city. If this location information is too granular or not useful for the task, we could simply remove these columns. Or, if we think the credit risk could vary based on the city, we could tokenize these fields instead. We won’t know the city name, however, we can still assess the predictive power of this variable if it is included in the model.

Do we need to know the exact age of the client? Or would having a general idea of their age be sufficient to assess credit risk? If the latter, we can perturb age by adding a random value within a certain range (e.g., [-5, 5]). (Just keep in mind the amount of noise could impact the utility/accuracy of your model).

Even though the application date is not a direct identifier, if we know that only one person applied on a certain day, we could re-identify this person. We could perturb these dates by adding or removing days (e.g., within [-3, 3]) to help reduce the ability to link this column with other information or datasets.

You could also tokenize the sex field. You probably don’t want to include this variable in your credit risk model. However, for fairness considerations, you should validate that people receive similar loan approval rates independent of their gender.

Finally, you can reduce the precision of salary, which could be considered sensitive information, by rounding it to the nearest 1000s. If you decide to include the debt ratio in your model, this transformation shouldn’t really impact the outcome of your model.

Let’s see how it looks in code:

Once you applied these transformations, your dataset should like this:

When applying privacy techniques, it’s always important to keep in mind the trade-off between privacy and utility. Let’s assume that age is a strong predictor to identify good credit risk vs bad credit risk. By increasing the amount of noise, you reduce the risk of the potential privacy leakage; however, you would also lose some utility. As you can see in the example below, the age distributions without perturbation and with perturbation within the interval [-5, 5] tend to be similar. Higher risk individuals tend to be younger, with a spike in the mid-20s. However, if you add more noise with perturbation within the interval [-20, 20], the distribution starts to look very different from the original distribution. By adding too much noise, you might lose the predictability of this variable.

In the future, we are planning to introduce differential privacy techniques which will allow you to quantify the privacy loss depending on the amount of noise, and also provide strong theoretical guarantees.

Bring transparency to your organization through policy-based interface

As mentioned earlier, we believe transparency and collaboration are critical if we want to apply privacy techniques effectively. Once you have figured out which transformations you’d like to apply to the different fields, you can express the exact same transformations in a policy file. Here is a subset of the policy file (the entire policy can be found here):

Once you have written your policy, it can be applied to your Pandas dataframe with only two lines of codes:

Make your Spark pipeline privacy-preserving

Finally, if you want to deploy these masking techniques at scale, you can apply this policy with the exact same two lines of code:

If needed, it’s also possible to apply the transformations to the Spark DataFrame with the programmatic approach we used earlier in Pandas. The transformation API is the same for Pandas and Spark.

Next Steps

Hopefully, this tutorial gave you a stronger idea about how and when you can apply privacy-preserving masking techniques, and how you can go from prototyping these techniques during exploratory analysis to applying these techniques transparently at scale.

We are planning to expand these techniques, and even give you the ability to create and contribute your own. We would love to hear your feedback and suggestions through our Product roadmap or via GitHub issues for additional integrations (e.g., Beam, Dask, etc.) and privacy techniques that you would find useful. We also encourage contributions, so feel free to take a look at our open issues and chime in via our Slack community if you’d like to get more involved!

In the coming weeks, we are planning to open-source additional Cape Privacy features, so stay tuned for future releases!

A big thank you goes out to Katharine Jarmul, Jason Mancuso, Morten Dahl and Dragos Rotaru for help with this post!

About Cape Privacy

Cape Privacy is an enterprise SaaS privacy platform for collaborative machine learning and data science. It helps companies maximize the value of their data by providing an easy-to-use collaboration layer on top of advanced privacy and security technology, helping enterprises increase the breadth of data included in machine learning models. Cape Privacy’s platform is flexible, adaptable, and open source. It helps build trust over time providing for seamless collaboration and compliance across an organization or multiple businesses. The company is based in New York City and is backed by boldstart ventures and Version One with participation from Haystack, Radical, and Faktory Ventures.