Evaluating Differential Privacy Tools’ Performance

Anshu Singh
AI Practice GovTech
13 min readApr 7, 2023

This article is the third in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click here to check out the other articles in this series.

In the previous article of this differential privacy series, we provided guidance for differential privacy tools based on qualitative and quantitative analysis. However, we only provided details on qualitative analysis.

As promised, we will dive into the quantitative analysis in this article. We will evaluate the utility and scalability of the benchmarked tools — OpenDP (library) by Harvard privacy team, Tumult Analytics (framework) by Tumult Labs, PipelineDP (framework) by Google and Openmined, and Diffprivlib (library) by IBM on synthetic datasets. Our experimental setup is largely based on the work by Gonzalo et al. and built upon its findings. For completeness, we will provide a brief overview of the experimental settings.

All the experiments described in this article are open-sourced.🎊

We have included links to new and advanced concepts and supplementary material to help you follow along with the technical aspects of differential privacy presented in this article.

Contents

Experiment Settings

Evaluation principles

Diversity of dataset characteristics: We used synthetic datasets generated using the skew-normal distribution to provide scale, skew, and size (refer to Figure 1) diversity.

  • Scale: The datasets of varying spread with values 50, 250, and 500.
  • Skew: The datasets of varying shapes with values 0, 5, and 50.
  • Size: The datasets of varying sizes range from 1000 to 1 billion data points. This wide range of sizes helps to test the tools’ scalability on large datasets in the standalone and distributed computing environment.
Figure 1: Sample of the synthetic datasets’ distribution with varying skew and scale of size 100k data points.

The diversity in the dataset characteristics can help simulate a reasonable model of real-world data distributions. We can comprehensively understand their behaviour and performance with diverse dataset characteristics by exposing the tools to datasets with varying scales, skewness, and sizes. This analysis is crucial for real-world applications where data is often complex and diverse in nature.

Diversity of ε (epsilon) values: We ran the experiments on 28 ε values ranging from 0.01 to 10 with logarithm steps. The choice of the ε values is informed by a combination of academic recommendations and practical considerations: experts in academics suggest positive ε values of less than 1, and practitioners often use values beyond this range in real-world applications. Additionally, our pilot experiments indicated that the performance of the tools marginally differed for ε values greater than 10. As a result, we limited the ε values to 10 to provide a more focused evaluation of the tools’ performance.

Diversity of analytics queries: We evaluate the commonly-used queries supported by all the tools — count, sum, mean, and variance. Except for the count query, other queries depend on the content of the dataset and require setting the clamping bounds (refer to Figure 2). To ensure our evaluation was consistent across all the tools, we set the maximum and minimum values in the datasets as the clamping bounds.

What are clamping bounds?

Figure 2: Example showing clamping bounds: value 3500 is clamped to 4000 (lower bound) and 15000 to 9000 (upper bound).

Clamping bounds limit data values within a specified range to minimise the influence of any individual, especially the outliers, on queries like sum, mean, etc. This is necessary to protect individual privacy and ensure the utility of the results.

Metrics: We used the mean of relative error (MRE) for the utility analysis to measure the accuracy of the queries’ output. MRE is widely used in the extant literature and allows the comparison of results from datasets of different scales. Lower MRE values indicate better performance. The error calculation baseline is the deterministic output of the queries. For the scalability performance, we measure the execution time (in seconds) of the queries.

Implementation details

The experiments were conducted on the AWS Glue (python environment) using a single DPU and 16 GB of memory. For the Spark-supported tools — Tumult Analytics and PipelineDP, experiments were conducted on AWS EMR utilising two r4.8xlarge instance types. As PipelineDP is backend-agnostic and supports standalone and distributed computing, we ran experiments on it in both environments.

Note: Readers who want to get started with scalable differential privacy can check out the final article in this series.

Experiment Results

Concepts to understand the experimental results

Neighbouring definition: There are two definitions of neighbouring datasets: unbounded and bounded. The unbounded neighbouring definition is used when a record is added or removed to hide the presence or absence of an individual in a dataset. For example, if a dataset contains information about patients in a hospital, the unbounded neighbouring definition might be used to ensure that the presence or absence of a particular patient is not revealed in the released data. The bounded neighbouring definition is used when a record is changed to protect the privacy of an individual attribute value when its presence is already known in the dataset. For example, if a dataset contains information about employees in a company, the bounded neighbouring definition might be used to ensure that the salary of a specific employee is not revealed in the released data while still allowing the employee to be identified.

Both definitions are important for different use cases of differential privacy and help ensure that the release of sensitive data is protected against privacy breaches.

Sensitivity: It is the maximum change in a query’s output on changing one record in a dataset, and the noise scale of a distribution increases with sensitivity (refer to Figure 3). Sensitivity computation depends on the mechanism used, the query to compute, and the neighbouring definition. Our work focuses on the Laplace mechanism, and refer to Table B in the Supplementary for the sensitivity computations of the experimented queries. We also refer the readers to Harvard’s papers on calculating the sensitivity of a query based on the neighbouring definition.

Figure 3: Noise scale increases with the sensitivity.

Utility analysis

Setup: We conducted utility analysis on synthetic datasets with varying sizes (1k, 10k, and 100k data points), skew, and scale, resulting in 27 datasets evaluated on the four queries — count, sum, mean, and variance. Each query was further run with 28 different ε values.

Most tools have default mechanisms and neighbouring dataset definitions (bounded or unbounded) for query computation (see Table 1). One consideration for the default mechanism is based on the data type; integer-valued data points default to the Geometric mechanism (also called discrete Laplace mechanism) that outputs integer random values and the float type defaults to the Laplace mechanism. We resort to the Laplace mechanism in cases with no default mechanism (pure-differential privacy).

Table 1: Default mechanisms (DM) and neighbouring definitions (ND) for tools by queries. Tumult Analytics takes privacy definitions as input, which can be pure (PureDP) or zero-concentrated (RhoZCDP) differential privacy. For PureDP, the default mechanism is listed, while RhoZCDP defaults to the Discrete Gaussian mechanism. Refer to Table A in the Supplementary.

Moreover, our implementation (refer to our GitHub repository) is tailored for a non-expert who might depend on automated parameter computation (such as sensitivity) and minimal inputs — ε value, clamping bound (set as maximum and minimum values of the dataset), and max user influence bounding (set as 1).

Impact of dataset skew, scale, size, and ε value on the accuracy of the queries: Our experimental results demonstrate a correlation between the noise scale and the error (MRE) of a query: decreasing noise scale resulting in reduced error. Given the Laplace distribution’s noise scale is inversely proportional to ε and directly proportional to the sensitivity of the query:

  • The error is higher for ε values less than 1, which is a range recommended by some differential privacy pioneers. However, the error approaches zero as the ε increases.
  • The sensitivity of queries increases as the spread of a distribution increases. And spread increases with low skew and high scale values. The neighbouring definition (bounded or unbounded) used to compute queries’ sensitivities also affected the noise scale (refer to Table B in the Supplementary). Notably, the noise scale is directly related to the privacy protection offered — higher noise scale results in high privacy protection.

Our findings further show that query performance improves with larger dataset sizes as the impact of noise diminishes. However, a large spread can still affect results due to probable outliers, even on large datasets.

We recommend that practitioners be aware of the high sensitivity of queries and compensate for it by selecting a higher value of ε, which improves accuracy. However, it’s important to consider the privacy implications carefully.

Performance of tools on the queries:

Count (refer to Figure 4): The sensitivity of the count query is always one and independent of the skew and scale of a dataset. Hence, we excluded skew and scale variables from the count query results. Diffprivlib outperforms other tools for ε ≤ 0.1 and smaller datasets (observable for ≤10k data points) due to the underlying mechanism Geometric truncated (refer to Table A of Supplementary) — which improves accuracy by mapping the negative count value to 0.

Figure 4: Mean relative error of the count query for experiments with synthetic datasets.

Sum (refer to Figure 5): Tumult Analytics and PipelineDP perform better than the other tools for ε less than 1. This is due to the unbounded definition for sensitivity calculation used by Tumult Analytics and PipelineDP, while others use the bounded notion. Our synthetic datasets have negative and positive values, resulting in higher sensitivity for the bounded definition than the unbounded one. For example, if the maximum value is 100 and the minimum value is -5, with bounded definition, the sensitivity of the sum query will be 100 — (-5) = 105, whereas for unbounded, it will be max(abs(-5), abs(100)) = 100. However, please note that bounded differential privacy with a higher noise scale will give stronger privacy protection than the unbounded differential privacy.

Figure 5: Mean relative error of the sum query for experiments with synthetic datasets.

Mean (refer to Figure 6): The sensitivity calculation for the bounded and unbounded definition is similar; therefore, we can observe nearly similar performance among the tools. Furthermore, all the tools use the Laplace mechanism as the default (or we used it in the code). Moreover, the skewness and scale impact follows a similar sum query reasoning. This is because, except for the Diffprivlib (uses NumPy mean API), other tools use sum and count queries as the building block for the computation.

Figure 6: Mean relative error of the mean query for experiments with synthetic datasets.

Variance (refer to Figure 7): Tumult Analytics and PipelineDP perform better than other tools on large-scale and low-skew values (higher noise distribution spread). OpenDP performs better on the lower scale and large skew values where the ε is less than 1. The difference in performance can be attributed to the varying notions of bounded and unbounded neighbouring definitions, which result in different sensitivity calculations for the variance query. Diffprivlib performs the worst on the variance query, although its outputs are consistent with the actual variance values and are greater than zero. This is due to Diffprivlib’s use of the Laplace Bounded Domain (refer to Table A of Supplementary), which increases noise while maintaining output consistency for the variance query.

Figure 7: Mean relative error of the variance query for experiments with synthetic datasets.

Most tools use sum and count queries as building blocks for variance computation or use similar mechanisms. As a result, the overall performance of the tools on variance demonstrates similar behaviour to the sum query across the three independent variables — skew, scale, and size, except for Diffprivlib, which uses NumPy’s variance API.

Scalability analysis

Setup: In our scalability benchmark, we measured the execution time of the queries on datasets ranging from 1000 to 1 billion data points. We used fixed values of ε (0.01), skew (0), and scale (250), as existing literature suggests scalability is largely independent of sampled noise. Further, our computation time only measured the query execution and excluded pre-processing steps, as practitioners’ implementations can vary.

Results: For scalability assessment, we compare the tools based on the environment they were executed in — standalone and distributed.

Figure 8 shows the execution time of the Diffprivlib, PipelineDP, and OpenDP tools in standalone environments. The missing results in the figure indicate instances where the query computation for a library led to a memory error.

Figure 8: Execution time of the tools — supporting the standalone environment, experimented on synthetic datasets of varying sizes.

The Diffprivlib library performed best across all queries and scaled relatively well with increasing dataset size. This is because the underlying core functionalities of Diffprivlib are implemented in NumPy, as opposed to PipelineDP and OpenDP, which are Python- wrappers around their core functionalities in C++ and Rust, respectively. In general, all the tools do not scale on larger datasets, resulting in memory error or a substantial increase in time.

Also, our implementation of OpenDP includes using a binary search algorithm to search for a noise scale and prevent manual input. This feature ensures a differential privacy guarantee but also adds to the execution time when running the queries. For PipelineDP, the version we experimented with is tailored for GROUP BY type queries that create partitions (subsets of the dataset). There are pre-processing steps for partition handling (such as adding empty partitions for public partitions) and individual contribution bounding across and within partitions. Global aggregations (which we experimented with), i.e., computing statistics on all the data points, are treated as one partition. The unnecessary pre-processing steps add some computational overhead, but in upcoming versions, we can expect added functionality to prevent such overhead for global aggregations, which will benefit practitioners whose use case involves running GROUP BY queries.

Figure 9 compares the execution time performance of Tumult Analytics and PipelineDP, two tools that support distributed computing, on the spark engine. Tumult Analytics predominantly outperforms PipelineDP on datasets with greater than 1 million rows across the queries. For a dataset of less than a million data points, PipelineDP largely outperforms Tumult Analytics. We also observed that Tumult Analytics has a higher memory consumption than PipelineDP. One reason is that Tumult Analytics creates temporary tables to avoid re-evaluating the query with fresh randomness for repeated queries.

Figure 9: Execution time of the tools — supporting the distributed environment, experimented on synthetic datasets of varying sizes.

Conclusions

In our empirical quantitative analysis, we provided insights into the tools’ implementation and how various factors can impact the accuracy of a query beyond the ε value, i.e., the scale, size, and skew of the data. Our analysis focused on the four most commonly used queries. However, there are other queries beyond these to experiment with (e.g., quantile and GROUP BY queries), including running multiple queries in a run that can optimise the privacy budget usage. In the future, we look forward to experimenting with other algorithms (such as Gaussian and Exponential algorithms), advanced privacy definitions (e.g., zero-concentrated differential privacy), and capabilities (e.g., population amplification and private dataset JOIN) offered by the tools.

Differential Privacy Series

GovTech’s DPPCC has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include PipelineDP by Google and Openmined, Tumult Analytics by Tumult Labs, OpenDP by the privacy team at Harvard, and Diffprivlib by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.

The first three parts are also put together in this whitepaper. ✨

DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy and facilitate privacy-centric data sharing.

For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.

Thanks,
Syahri Ikram (syahri@dsaid.gov.sg), for the exceptional support and tireless efforts in helping me with cloud-based experiments.
Alan Tang (alantang@dsaid.gov.sg), Ghim Eng Yap (ghimeng@dsaid.gov.sg), Damien Desfontaines (Tumult Labs, Staff Scientist) and Prof. Xaiokui Xiao (School of Computing, National University of Singapore) for the valuable inputs.

Author: Anshu Singh (anshu@dsaid.gov.sg)

Additional Resources

Articles:

Papers

Videos

References

  1. Bun, Mark, et al. “Differentially private release and learning of threshold functions.” 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.
  2. Alabi, Daniel, et al. “Differentially private simple linear regression.” arXiv preprint arXiv:2007.05157 (2020).
  3. Abadi, Martin, et al. “Deep learning with differential privacy.” Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016.
  4. Stemmer, Uri, and Haim Kaplan. “Differentially private k-means with constant multiplicative error.” Advances in Neural Information Processing Systems 31 (2018).
  5. Abowd, John M. “The US Census Bureau adopts differential privacy.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
  6. Garrido, Gonzalo Munilla, et al. “Do I get the privacy I need? Benchmarking utility in differential privacy libraries.” arXiv preprint arXiv:2109.10789 (2021).
  7. Garrido, Gonzalo Munilla, et al. “Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry.” arXiv preprint arXiv:2211.03898 (2022).
  8. https://projects.iq.harvard.edu/files/opendp/files/opendp_white_paper_11may2020.pdf
  9. Gaboardi, Marco, Michael Hay, and Salil Vadhan. “A programming framework for opendp.” Manuscript, May (2020).
  10. Berghel, Skye, et al. “Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy.” arXiv preprint arXiv:2212.04133 (2022).
  11. https://pipelinedp.io/
  12. Holohan, Naoise, et al. “Diffprivlib: the IBM differential privacy library.” arXiv preprint arXiv:1907.02444 (2019).
  13. Mironov, Ilya. “On significance of the least significant bits for differential privacy.” Proceedings of the 2012 ACM conference on Computer and communications security. 2012.
  14. https://desfontain.es/privacy/renyi-dp-zero-concentrated-dp.html
  15. Balle, Borja, Gilles Barthe, and Marco Gaboardi. “Privacy amplification by subsampling: Tight analyses via couplings and divergences.” Advances in Neural Information Processing Systems 31 (2018).
  16. Hay, Michael, et al. “Principled evaluation of differentially private algorithms using dpbench.” Proceedings of the 2016 International Conference on Management of Data. 2016.
  17. Bun, Mark, and Thomas Steinke. “Concentrated differential privacy: Simplifications, extensions, and lower bounds.” Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016.
  18. Warner, Stanley L. “Randomized response: A survey technique for eliminating evasive answer bias.” Journal of the American Statistical Association 60.309 (1965): 63–69.

Supplementary

Table A: The privacy definitions and mechanisms referred to in the series. The table has been referenced from this paper.
Table B: Sensitivity calculations of the experimented queries with the bounded and unbounded neighbouring definitions. M is the upper bound, and m is the lower bound of a dataset. Refer to Harvard’s papers on calculating the sensitivity of a query based on the neighbouring definition.

--

--