WhiteNoise: An Open Source Library to Protect Data with Differential Privacy
Ever been tempted to respond to one of those salary surveys? Even though you might consider your own salary to be a supremely private affair, it could still be interesting to learn how your salary compares to that of your peers. So you might be tempted to provide your salary to a trusted researcher, in the knowledge that it will be kept private and only used to calculate aggregate statistics like average salary or quartiles, perhaps broken down into subgroups like age or region or industry, or combinations of those.
But can your salary data really be kept private?
It turns out, having your raw data leaked by (or stolen from) the researcher isn’t the only thing you should be worried about. Recent research has discovered that even summary tables in reports can reveal individual information.
One obvious way this can happen is if the cross-tabulations are too narrow: if there are only three database administrators in ZIP code 90210 aged 30–39, then the median will reveal the exact salary of one of them. But the problem is even more subtle than that: if too many cross-tabulations are released, it may be possible to reverse-engineer many of the records in the source dataset, possibly including yours. With enough computation power and time, it may be possible to search all possible records the source dataset could hold, and discover that only certain data records could possibly generate in the values given in the tables.
To combat this problem, Harvard and Microsoft have collaborated on the OpenDP Project and released WhiteNoise, an open-source software framework that implements the latest techniques in Differential Privacy research. Differential Privacy is not an algorithm, but rather “an entire scientific field at the intersection of computer science, statistics, economics, law and ethics”. Differential Privacy offers strong guarantees about user data (as described in the OpenDP White Paper):
- Differential privacy protects an individual’s information essentially as if her information were not used in the analysis at all, in the sense that the outcome of a differentially private algorithm is approximately the same whether the individual’s information was used or not.
- Differential privacy ensures that using an individual’s data will not reveal essentially any personally identifiable information that is specific to her, or even whether the individual’s information was used at all. Here, specific refers to information that cannot be inferred unless the individual’s information is used in the analysis.
At its core, WhiteNoise allows researchers to generate reports from data in such a way that individual data records cannot be reverse-engineered. By adding a little noise to the result, and relying on some rigorous underlying mathematics, a researcher can be sure that the summaries they release do not compromise the privacy of any of the individuals represented in the underlying dataset. For example, this Python notebook generates a differentially-private histogram of salary data (the red bars). Compared to the true values (the green bars) a small amount of noise has been added, in such a way that individual salaries cannot be inferred.
More generally, WhiteNoise allows for the implementation of a “Trusted Curator”: a kind of gatekeeper for ad-hoc queries of data, that:
- checks if a query is differentially private;
- checks if too many queries have been made already, which would exhaust a pre-set “privacy budget”, and if both those checks pass:
- returns differentially-private results that are close to, but not exactly equal to, the real result.
With a trusted curator in place, datasets that include private information can be used in contexts where privacy must be preserved. For example:
- Archival data repositories can offer researchers access to sensitive data while protecting data. This enables the search and exploration of sensitive data (within privacy budgets) and allows for statistical analysis.
- Official government agencies can safely share sensitive data with researchers, policy makers and the public. Government reports can safely include rich statistical summaries of data.
- Companies can deploy predictive models based on users and customer data, while protecting privacy and complying with data-protection regulations.
The core of the WhiteNoise software system is a set of Rust libraries with a protocol buffer API, and language bindings for Python, C++ and R. You can download the WhiteNoise software from the OpenDP GitHub repository and run it on any compatible system, and in the next section we’ll show how you can quickly try it out using the Azure Machine Learning service.
Using WhiteNoise in Azure ML
Azure ML is a cloud-based environment for training and managing machine learning models, so that makes it especially suitable for differential privacy. You can quickly and easily launch a cloud-based Compute Instance with the memory and compute capacity you need for your application, and run Python notebooks on the cloud instance to implement the differential privacy procedures available in WhiteNoise.
Next, within the Azure ML studio interface, create a new Compute Instance by clicking “New > Compute Instance”. You have several options for the power and capacity of the underlying virtual machine, but the default
Standard_DS3_v2 instance type is sufficient to run the sample notebooks. (Note: you will be charged standard virtual machine rates until the compute instance is deleted.)
A good place to start is to upload the WhiteNoise sample notebooks. In that repository, click the “Clone or Download” button and download the ZIP file.
Unzip the ZIP file on your local machine. Now, we’ll upload the sample notebooks to Azure ML notebooks, where we can run them using the compute instance we just created.
Within the Azure Machine Learning studio, click on the Notebooks icon, and then the Upload folder button. Upload the folder of the ZIP file you just extracted.
Now you can open the sample notebooks, by browsing to the whitenoise-samples-master folder you just uploaded, under My Files:
One last step before you get started: in the
analysis/basic_data_analysis.ipynb notebook, add a new code cell by clicking the ➕ icon in the margin, and add this text:
!pip install — upgrade opendp-whitenoise opendp-whitenoise-core
Run that cell to install the WhiteNoise software on your compute instance. You’ll only need to do this once for each new compute instance.
Now, you’re ready to run the WhiteNoise samples on Azure ML, or even start your own differential privacy analysis.
For More Information
Here are some resources to help you learn more about WhiteNoise:
- Find more information about Differential Privacy and download the software from the WhiteNoise website.
- For an in-depth look at the software, download the white paper WhiteNoise: A Platform for Differential Privacy.
- For a demo of WhiteNoise in action, check out the presentation Responsible ML: Protect Privacy and Confidentiality with ML from the Build 2020 conference, or watch the AI Show episode Protecting Sensitive Data using Differential Privacy embedded below.
- Finally, to contribute to WhiteNoise or to find the latest updates, visit the OpenDifferentialPrivacy GitHub page where you can find all the repositories that make up the WhiteNoise system.