How to share tabular data in a privacy-preserving way

Adding noise to existing rows, only adding noise to outcomes of tasks performed on that data, or synthetic data generation? An intuition.

Coussement Bruno
datamindedbe
8 min readNov 18, 2020

--

Source: Pixabay

As companies grow, or as regulations get more strict, or as senior IT architects get up to speed with the latest trends, the need (or obligation) to mitigate privacy and leakage risks get stronger for data processing entities.

Data anonymization or data tokenization techniques are widely used in this context, even though they still allow for the divulgence of private information (see https://mostly.ai/why-synthetic-data/ for an easy explanation on why this is).

Synthetic data generation

Synthetic data is fundamentally different. The goal is to come up with a data generator that shows the same global statistics as the original data. It should be hard to distinguish for a model or person between the original and generated.

Let’s illustrate this by generating synthetic data on the Covertype dataset using the TGAN model.

The Covertype dataset. Source screenshot: Author

After training the model on this table, I generated 5000 rows and plotted a histogram of the Elevation column of original and generated set. Both lines seem to match visually.

Histogram of Elevation column of original and generated set. Source: Author

To check for relations between pairs of columns, the pairplot of all the continuous columns is shown. The form that the blue-green dots (generated) form should visually match that of the red dots (original). This is the case, nice!

Pairplots of continuous columns of original and generated set. Source: Author

If we now look at mutual information (a.k.a. correlation without sign) between columns, then columns that are correlated to each other should also be correlated in the generated set. Reversely, columns without showing no correlation in the original set, should not be correlated in the generated set. A value close to 0 means no correlation, while a value close to 1 means perfect correlation. Great, this is the case!

Mutual information between columns original set. Source: Author
Mutual information between columns generated set. Source Author

As a last test, I wanted to train a dimensionality-reduction (UMAP) technique on the original set, and project the original points to a 2D space. I feed the same projector with the generated set. The orange X’s (generated) should lie in blue point clouds of the original dataset. This is indeed the case. Neat!

2D projection of generated and original sample using UMAP trained on original data. Source: Author

Ok, that was fun playing around with. For more serious cases, there are 2 main approaches:

  • Rule-based stochastic data generation: user specifies sampling distributions and specific rules to sample from. For example:
    - column A: should have female names,
    - column B: should be a country in Europe,
    - column C: should be an integer uniformly sampled between 1 and 100 if the country in column B is “France” else a constant.
    Good frameworks are Faker, Trumania.
  • Deep generative models: can be used to learn the statistical distribution the real data has been supposedly sampled from. Once you have a good approximation of this distribution, a synthetic arbitrary-sized dataset can be sampled at will from it. This is what all the cool kids do these days.
    Initiatives worth looking at are Synthetic data vault, Gretel.AI, Mostly.ai, MDClone, Hazy.

Today, you could already set-up a proof-of-concept using synthetic data to solve one of the following common issues faced in IT-organisations:

  • No useful data in development environment
    Let’s say you are working on a data product case (can be anything), where the interesting data resides in a production environment with very strict access policies. Unfortunately, you only have access to a development environment without interesting data.
  • God-like data access privileges of data scientists and engineers
    Let’s say you are a data scientist, and suddenly the security architect has restricted your much needed privileges on production data. How can you still perform your work up to a satisfying degree of quality in these restrictive conditions?
  • Sharing confidential data with an untrusted external partner
    You are part of company X. Organisation Y would like to showcase their latest and greatest data product ( can be anything). They ask you for a data extract such that they can show it to you.

How does synthetic data fit in differential privacy?

The main promise of synthetically generating data is that no matter what post-processing is done on it, or third-party information is being linked to it, no one will ever be able to know if a single entity is contained in the original set or not, or obtain properties of it. This promise is part of a larger concept called differential privacy (DP).

Global vs local differential privacy

When talking about DP, it always comes in two kinds.

Often you’re only interested in the outcome of a specific task (for example, training a model on unsharable patient data from different hospitals, computing the mean number of people who ever committed a crime, …), you should look at global differentially privacy. In this case, an untrusted user will never see the sensitive data. Instead, he or she tells a trusted curator (with global differential privacy mechanisms), which has access to the sensitive data, what operations should be performed. Only the outcome is shared with the untrusted user. Checkout Pysyft and OpenDP if you want more information on tools doing this.

Traditional global differential privacy. Source: Author

In contrast, if a dataset needs to be shared with an untrusted party, local differential privacy principles come into play. Traditionally this is done by adding noise to each row of a table or database. The amount of noise to be added depends on

  • the required level of privacy (the famous epsilon in DP literature),
  • the dataset size (a larger dataset needs less noise to achieve the same privacy level),
  • the datatype of a column (quantitative, categorical, ordinal).
Traditional local differential privacy. Source: Author

In theory, for an equal level of privacy, a global DP mechanism (noise added on result) will show more accurate results than a local mechanism (noise on row-level).

Synthetic data generation techniques could thus be seen as a form of local DP.

For more in-depth information in these topics, I advice looking at:

Recommendation

Let’s get more concrete. You want to share table containing private information to an untrusted party. Right now, you could either add noise to rows of existing data (local DP), setup and use a trusted system (global DP), or you could generate a synthetic dataset based on the original.

Add noise to rows of existing data if

  • you don’t know what operation will be done on the perturbed data once shared,
  • you need to periodically share an update of the original data (= have this workflow as part of a stable batch process),
  • you and the data owners trust the person/team/organisation that will add the noise to the original data.

The best starting point are OpenDP tools.

The most well known case of differential privacy is the US Census data (see https://databricks.com/session_na20/using-apache-spark-and-differential-privacy-for-protecting-the-privacy-of-the-2020-census-respondents). This data gets recomputed and released every three years. It’s mostly numerical data that gets aggregated and published on several levels (county, state, nation-wide).

Setup and use a trusted system if

  • the system you have in mind supports the tasks and operation that will be performed on it,
  • the underlying data lies in different places and it can’t leave it (for example different hospitals),
  • you and the data owners actually trust the actual system and the person/team/organisation setting it up.

As a user of the sensitive data, you will get more accurate results compared to first approach.

Many of the frameworks don’t have all the required features yet to get this beast deployed in a secure, scalable, audit-able way. There is yet a lot of engineering involved. But as adoption grows over time, this might become a good alternative for large organisations and consortiums.

The best starting point for this option is OpenMined.

Generate synthetic data if

  • the original table is relatively small (<1M rows, <100 columns),
  • ad-hoc generation is enough (no periodic re-generation needed),
  • you and the data owners trust the person/team/organisation that will generate synthetic data for you.

As with the small experiment above, results are promising. It also doesn’t require much knowledge of DP systems in the first place. You could get started today if needed, let it train overnight, and have a sharable synthetic set by tomorrow morning, so to speak.

The biggest downside is that these complex models can get expensive to train and maintain if the size of the dataset increases. Each table also requires its own full model training (transfer learning not really a thing tabular data). This won’t scale to 100’s of tables, even with a substantial computing resource budget.

Otherwise, you’re out of luck.

Conclusion

With privacy more important than ever, there are great techniques available today to either generate synthetic data or add noise to existing data. However, they all still have their limitations. Besides a few niche cases, there is no enterprise-grade scalable and flexible tool yet which allows you to share data containing private information to untrusted parties.

Lastly, data owners still have to trust the methods or systems set in place, which requires a non-trivial leap of faith. This is the biggest challenge.

For now, if you want to have a go at it (a proof-of-concept, just playing around), checkout any of the links mentioned above.

Acknowledgements

Thank you Kris Peeters and Gergely Soti of Data Minded for your feedback on the article.

--

--