Protecting sensitive data throughout the data science workflow is becoming increasingly critical: data breaches are ever costlier, precious assets can be lost, and regulations are getting stricter. Projects fail or never get started because the data could not be made available safely.
A common way of mitigating this risk is to create a safer version of the dataset. This is achieved by applying a technique called data masking. It is the legacy tool-set used to preserve data privacy when giving data scientists access to sensitive data. Yet, its limitations make it hard to meet both the data protection and the data science objectives:
- It has weak data protection properties leaving privacy risks unaddressed.
- It can be detrimental to the data value impacting model quality.
- It is costly to put in place and manage due to regulatory constraints.
We will explore how data masking works and explain its limitations. We will also propose an alternative approach that leverages privacy-preserving learning techniques to eliminates the need for data masking and offer both higher data protection and the potential for more advanced models.
How Masking Data works
Data Masking refers to altering the fields in a way that makes it harder to link a record to an individual. There are several techniques but they all pursue one of two objectives: reducing linkability or reducing accuracy.
This aims to prevent linking an individual with an external database while preserving the linkability between records within the database. It is often called pseudonymization and is usually achieved by encryption or hashing.
This involves making a field less accurate to reduce the chance of revealing the individual. There are two main ways to achieve this:
- Remove precision using techniques such as truncation or generalization (eg: replacing birth dates with birth years). Since many records may share the same value, it helps with further aggregation in a relational database.
- Add noise to a value (eg: add a random number of days to a birthdate).
Note that from the perspective of the data receiver, removing precision and adding noise are very similar as information is deliberately deteriorated. Less information makes re-identification harder, but also lowers data utility. Designing a data masking strategy involves solving this tradeoff.
Let’s use some sample data to dig in and see how it works in practice.
An Example Using Mobility Data
Let’s assume we are working on the master-plan for our city and are planning to create new bus lines with a special focus on underprivileged areas. We want to use the trip database of a ride-hailing company to get a better sense of mobility demand. The ride-hailing company only agrees to share data if it is fully anonymous. They implement the following data masking strategy:
Field | Original value | New value | Technique
Full Name | Jane Smith | c38a81f7 | Pseudonymization
-------------+------------------+---------------+------------------Pick-up time | 7/4/2020 10:21am | 7/4/2020 10am | Truncation
-------------+------------------+---------------+------------------Pick-up | 55 Water St, | Financial | Generalization
| New York, NY | District, NYC |
-------------+------------------+---------------+------------------Drop-off | 130 Prince St, | SoHo, NYC | Generalization
| New York, NY | |
-------------+------------------+---------------+------------------Price paid | $14.50 | $14.50 | Unchanged
What we can achieve with our dataset
For high level studies, it is a valuable asset. We can study the market size, trends on daily routes, or the impact of congestion pricing.
Drawbacks of our masking strategy
Opportunity cost: Advanced models are now out of reach, for instance:
- Design an optimal bus route or place bus stops. The truncation of the locations prevents us from positioning anything precisely.
- Offer better options for people with special needs: pseudonymization prevents from linking records with the city database.
Weak privacy: While we cannot find individuals by looking up their name or address, there are still ways to identify someone with additional knowledge about an individual. If we know the times of a few trips home from an individual, we could probably single them out and learn all of their trips. A manager approving employee expenses could infer all of their trips using just the fare, time, and drop-off from just one expense report. There are countless possible re-identification attacks.
The Main Limitations of Data Masking
It is hard to know how much data protection it provides
Data masking implicitly assumes a simple attacker who looks up fields for which the connection with an individual is obvious (eg: name, social security number, address). It protects from these attacks efficiently with simple heuristics, such as deleting 18 identifiers in the HIPAA method.
But any remaining piece of information may lead to re-identification when combined with external information. An attacker may use public or private information, including information that will be available in the future! In the infamous Netflix case, users were re-identified using the movies they had watched — not names or addresses.
Data masking does not provide guidance on how to deal with this diversity of attacks. It makes it unlikely that a good utility/protection trade-off will be found. This is the cost of not having a robust data protection theory.
It is inadequate with unstructured or high-dimensional data
The problem becomes even more acute with high-dimensional data since the more information left in the dataset, the weaker the data protection. Small tabular datasets may be manageable — though it failed fast in our example — but, in richer datasets, preserving privacy and releasing a useful dataset appears out of reach. Here are examples where the altered data could not be deemed anonymous with standard data masking techniques:
- DNA has billions of nucleotide pairs that may be matched to individuals using genealogy or expression in phenotypes (pictures do reveal DNA!)
- Long location histories are easily matched with public information such as posts on social media.
- Messages include many clues on the author, such as word frequency or syntax (who would mention Sarus Technologies in 2019?).
Also, data masking techniques do not easily apply to less structured data such as free text, audio recordings, or history of GPS traces. With these data types, filtering fields one by one does not make much sense. To deal with any sensitive data, the problem needs to be tackled at a higher level.
Managing data masking rules is costly
Data masking can be both time consuming and risky. Depending on the objectives, masking a field may be indifferent or prohibitive; the residual risk may range from acceptable to intolerable depending on who uses the data. Compliance teams have to work with science and engineering teams to find an appropriate balance between risk, compliance and utility. And this is true for every new data project. A lot of time can be lost and many projects never come to fruition when a balance cannot be found. This dampers innovation and learning.
A New Approach to Keeping Data Safe
Some improvements on data masking have been proposed to reinforce data protection. Techniques such as k-anonymity or l-diversity provide more privacy but at the cost of growing complexity. And, they still don’t address the limitations on richer data types.
With Sarus, we have radically changed the approach to learning on sensitive data. Instead of trying to release a “safer” version of the data, we keep the original data intact and enable data scientists to work on it remotely. We no longer focus on making sure that individuals cannot be found in datasets — which becomes exponentially difficult as the data grows. Instead, we invest in hiding the individuals in the output of the learning process — which is both easier and more efficient. It eliminates the need for data masking entirely.
How Sarus Works
Companies install Sarus on their infrastructures and open a secure gateway for data scientists to train models remotely. Working on the unaltered dataset ensures data scientists can extract all data utility. Sarus implements Differential Privacy on all interactions so that data protection holds irrespective of data types and use cases. Once Sarus is installed, there is no need for project-based compliance assessments, saving precious time for innovation.
With Sarus, there is no need for data masking and its limitations are addressed as follows:
- Better data protection: Sarus implements Differential Privacy on all interactions which provides an objective and robust framework to measure privacy risk.
- Use all data: By focusing on making the learning process anonymous instead of making the source data anonymous, Sarus enables learning on all data structures. This way data protection guarantees hold irrespective of data types and use cases.
- Faster processes: Managing an ad hoc rule set for each project is unnecessary because a single approach provides ultimate data protection in all applications. Data teams save precious time for innovation.
In our mobility data example, if the ride hailing company had installed Sarus, the master plan team would have been able to learn from the unmasked data without the risk of leaking personal information. They would have achieved all of their objectives without compromising the privacy of individuals.
At Sarus, we believe leveraging more data faster is critical for successful innovation — but it should not come at the cost of privacy.
For more information, please visit sarus.tech.
Sarus designs privacy preserving solutions for faster innovation and collaboration on data with stronger data protection. With Sarus, data practitioners can safely work on full datasets that were once out of reach, creating opportunities for new applications.