Sharing Data with Differential Privacy: A Primer

This article is the first in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click here to check out the other articles in this series.

“Human intuition about what is private is not especially good. Computers are getting more and more sophisticated at pulling individual data out of things that a naive person might think are harmless.”

Frank McSherry, Co-inventor of differential privacy, Co-founder and Chief Scientist, Materialize, Inc.

Collecting and disseminating personal data by various institutions, governments, and organisations have become ubiquitous today, posing significant privacy concerns, especially for vulnerable communities. The traditional anonymisation techniques — ad-hoc methods including statistical disclosure limitation (e.g., generalisation, suppression, and k-anonymity) and releasing aggregates (e.g., count, mean) — have limitations and are not universally applicable for data protection. These data privacy techniques provide guarantees in only a heuristical or empirical sense and require careful analysis of the adversary’s computational power, auxiliary information, and current threats. Moreover, releasing multiple aggregates increases the risk of reconstructing the original data and requires assessing the accumulated risk with each aggregate release. These limitations have led to real-world privacy debacles and increased privacy risks.

Introducing Differential Privacy

Datasets sharing with differential privacy.

The need for stronger privacy protection measures has resulted in increased support and adoption of differential privacy by government institutions and commercial enterprises for datasets releases that could otherwise not be shared:

  • It provides a quantifiable, provable guarantee about the worst-case privacy risk.
  • Its privacy guarantee requires no assumption about auxiliary information: an attacker with detailed information or access to future datasets releases can still not get an advantage.
  • Its privacy guarantee requires no assumption about the threat from an adversary’s computational power (even if unlimited) and can prevent any arbitrary threat.
  • It allows quantifying the accumulated privacy risk of releasing multiple statistics about individuals.

The National Institute of Standards and Technology (NIST) recently added differential privacy in their proposed data anonymisation approaches to government data sharing.

What is Differential Privacy?

Differential privacy is a new notion of privacy that enables the release of data or results from data analysis with a mathematical guarantee of privacy. It focuses on data-releasing algorithms (such as computing the mean, sum, count, etc. of a dataset) that take in analytical queries and produce outputs that have been altered in a controlled, random manner. These algorithms are considered differentially private if, based on the outputs, it is difficult to determine whether any individual’s information was included in the original data. This is accomplished by guaranteeing that a differentially private algorithm’s behaviour hardly changes whether or not a single individual is a part of the data.

Differential privacy adds a calculated amount of noise to hide each individual’s contribution to data.

To get this similarity, differential privacy adds a calculated amount of randomness or noise to the analytical queries. The magnitude of noise to add, which determines the degree of privacy, depends on the type of analysis. It must be sufficient to hide the largest contribution that can be made to the output by one individual.

Global vs Local differential privacy differs with respect to the trust in the data curator.

The noise can be added directly to the aggregates (global mode) or individual data points before aggregation (local mode). The former assumes a trusted data curator holds sensitive information. The latter assumes an untrusted data curator and maintains privacy at the source before data leaves the data subject’s control.

Global differential privacy is more widely adopted due to its more accurate results than the local mode for the same level of privacy protection. However, the choice of a mode depends on the level of trust in the data curator.

A Simple Mathematical Interpretation

We understood that the outcome of a differential privacy analysis would be similar whether or not an individual’s data is included. By ‘similar,’ we mean that the probabilities are close. Let’s now translate this into a formal definition.

Suppose exactly one person’s data is added or removed from the data. These are called neighbouring data, denoted by D and D’, that differ in exactly one person’s record. And a mechanism M (differential privacy lingo for an algorithm) adds random noise from a probability distribution to analytical queries. Then all possible outputs O of M will be probabilistically similar (similar approximate outputs) for all D and D’. The degree of similarity depends on the privacy loss parameter ε (epsilon): the smaller it is, the more similar the outputs are from the neighbouring data, and mathematically given as

The output of mechanism M will probabilistically be the same for the neighbouring data D and D’. The similarity depends on the privacy parameter ε.

ε is a privacy parameter that is like a tuning knob to handle the privacy accuracy trade-off.

Lower ε value gives stronger privacy and less accuracy. Higher ε value gives weaker privacy and more accuracy.

Smaller ε increases the noise scale — privacy increases and accuracy degrades.

Now, what is a judicious choice of ε?

There is no consensus on the choice of ε, and its value depends on the specific context and use case. It is generally recommended to set a positive ε value of less than 1 to achieve conservative privacy. However, practitioners may sometimes use higher ε values to achieve greater analysis accuracy. It is important to remember that a larger ε value may compromise privacy guarantees, underscoring the need to carefully consider privacy and accuracy trade-offs in selecting the ε value.

This article, regularly updated by the author, presents thorough compilations of ε values (ranging from low to high) used in real-world applications.

A Simple Example of Applying Differential Privacy: Counting the number of survey participants with differential privacy

Suppose we want to count the number of survey participants in a differentially private manner. To do this, we need to add random noise to the count. This noise should come from a probability distribution with some necessary properties to satisfy the definition of differential privacy. One such distribution is the standard Laplace distribution, denoted as laplace(noise_scale).

The mechanism that adds random noise to a query output (count in this case) from the Laplace distribution is called the Laplace mechanism. For a query output on dataset D, the Laplace mechanism is given as

laplacian_mechanism(D, noise_scale) = query(D) + laplace(noise_scale)

Here, noise_scale = sensitivity/ε, where sensitivity is the maximum possible change in a query’s output when an individual’s record is added/removed. This ensures that a sufficient amount of noise is added to hide each individual’s contribution in a dataset. Notice the noise scale is directly proportional to the sensitivity and inversely proportional to ε.

For the count query, the change can be a maximum of 1 on adding or removing an individual from the survey data. So the sensitivity of the count query is always 1. Let ε = 0.01, then noise_scale = 1/0.01 for the Laplace distribution. The differentially private count can be given as

private_count = laplacian_mechanism(D, noise_scale)

Suppose the true count is 1000. On executing the laplacian_mechanism multiple times, private_count can take on different random values (1001.23, 999.81….) nearly close to the true count. Adding or removing an individual’s data can still produce nearly accurate outputs for meaningful analysis and decision-making.

Graph showing how with the increasing epsilon value, the error decreases due to the decrease in the scale of noise.

It is possible for someone to determine which probability distribution the algorithm uses to estimate the distortion of the true output. However, the specific random number added by the mechanism remains unknown. This randomness and the similarity of outputs increase the uncertainty of inferring an individual’s contribution to a dataset.

Resource: We developed this python notebook to help understand two standard differential privacy mechanisms — Laplace and Gaussian.

Key Properties of Differential Privacy

In the following example, we will see the composition property of differential privacy in practice as it applies to a real-world use case.

A Real-world Example Applying Differential Privacy: LinkedIn Labor Market InsightWho is hiring?

LinkedIn used differential privacy to measure which employers were hiring the most and their percentage growth in hires, using data from the current and previous three months with a total ε = 14.4. Differential privacy protected LinkedIn users who may have changed jobs.

Each month’s report (for both current and previous) cost ε = 4.8, which added to a total cost of 14.4 (=4.8+4.8+4.8) for three months.

Note: To simplify our explanation, we have focused solely on the ε privacy parameter and have not included a complete breakdown of the total ε = 14.4 used. For more detailed information, please refer to the paper.

LinkedIn used differential privacy to measure top employers in Software & IT Services hiring in the U.S. for the July 2020 report, sorted from top to bottom. The bar represents the percent growth in hires based on noisy counts (of the previous and current span of 3 months) computed using differential privacy.

Resource: We developed this guided python notebook to get you started with computing anlaytical queries — count, sum, mean, histogram, and contingency table using differntial privacy.

What is a privacy budget?

Leveraging the powerful composition property of differential privacy that allows adding up epsilons to quantify cumulative privacy loss, ε is also called a privacy budget. It allows restricting the type and number of data queries to prevent breaches. A good analogy for privacy budgeting is shopping with a fixed budget, where you have to prioritise which items are most important for your wardrobe. Similarly, for privacy budgeting, you might prioritise the accuracy of certain queries and allocate a larger portion of the privacy budget to those.

Uneven splitting of the privacy budget ε of 1 (in this case) across the queries.

Setting a privacy budget depends on the data holder’s risk tolerance, and overspending it can increase the privacy risk, such as reconstruction attacks.

Applications of Differential Privacy

Differential privacy can be used for applications ranging from simple aggregate analysis (e.g., count queries, histograms) to machine learning tasks such as clustering, classification, and synthetic data generation.

Some well-known applications of differential privacy can be seen in companies such as Apple, Google, Microsoft and the Wikimedia Foundation (as shown in the figure below). Even in the government sector, the US Census Bureau has adopted differential privacy to protect sensitive information in the summary statistics for the 2020 Decennial Census.

Real-world uses of differential privacy.

Conclusion

Differential privacy has emerged as a robust alternative to traditional anonymisation techniques, as it offers provable data privacy guarantees while allowing for meaningful data analysis. By adding controlled noise to data, differential privacy can prevent the re-identification of sensitive information and protect individuals’ privacy. Moreover, it enables accurate and reliable data analysis without the loss of data quality and analytical value that often results from traditional anonymisation techniques. Differential privacy allows for exploring sensitive data across silos, potentially shortening data access times by relaxing the adversity of data request processes, and can fulfil some types of use cases.

The academic and industrial communities have developed various tools to facilitate the adoption of differential privacy. These tools provide a higher-level interface and abstract implementation complexities. However, practitioners need help choosing the right tool for their needs, as each tool offers different functionalities, security, performance, and usability guarantees.

Although these tools exist to apply differential privacy conveniently to sensitive data, a gap remains in providing a user-friendly interface for non-experts to visualise and enable interactive privacy-utility tradeoffs, such as tuning epsilon value and adjusting accuracy. This is crucial in addressing the negotiation challenges between data curators and analysts.

Differential Privacy Series

GovTech’s DPPCC has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include PipelineDP by Google and Openmined, Tumult Analytics by Tumult Labs, OpenDP by the privacy team at Harvard, and Diffprivlib by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.

The first three parts are also put together in this whitepaper. ✨

DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy and facilitate privacy-centric data sharing.

For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.

Thanks, Alan Tang (alantang@dsaid.gov.sg) and Ghim Eng Yap (ghimeng@dsaid.gov.sg), for the valuable inputs.

Author: Anshu Singh (anshu@dsaid.gov.sg)

--

--