Practitioners’ Guide to Accessing Emerging Differential Privacy Tools

Published in

DSAID GovTech

11 min readApr 7, 2023

This article is the second in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click here to check out the other articles in this series.

Our previous article showed how differential privacy can be a robust alternative to traditional anonymisation techniques. We provided a friendly explanation of the fundamental concepts of differential privacy with examples and a simplified mathematical interpretation.

Individuals seeking a more in-depth understanding of differential privacy can refer to additional resources provided at the end and our guided Python notebooks to help you get started.

This article in the series is an effort to guide practitioners in evaluating emerging differential privacy tools through a comparative analysis for real-world analytics. We have also included links to new and advanced concepts and supplementary material to help you follow along with the technical aspects of differential privacy presented in this article.

– Introduction
– Benchmarked Differential Privacy Tools — Libraries And Frameworks
– Key Desiderata for Qualitative Analysis of the Tools
– Guidance For Practitioners On Choosing A Tool
– Conclusion

Introduction

We have seen an increase in the development of differential privacy tools to improve accessibility and reduce the need for expert input. These tools provide a higher-level interface, abstracting implementation complexities and ensuring meaningful differential privacy guarantees, automated utility optimisations, etc. However, practitioners need help choosing the right tool for their needs, as each tool offers different functionalities, security, performance, and usability guarantees. To address this, we evaluated existing tools — libraries (provide specific functionalities) and frameworks (provide abstractions to build specific applications) — through qualitative analysis of their features and characteristics and quantitative analysis of their performance and accuracy across common analytical statistics (queries).

Note: To keep this article brief, we have deferred the quantitative analysis to the next article in the series, which will focus on a comprehensive experimental assessment of the accuracy (a proxy for utility) and execution time (performance) of common analytical queries. In this article, we have summarised the key findings in Figure 2 (below).
All the quantitative experiments are open-sourced. 🎉

Benchmarked Differential Privacy Tools — Libraries And Frameworks

We consider the tools by prominent researchers and institutions with the potential for wider adoption — OpenDP (library) by Harvard privacy team, Tumult Analytics (framework) by Tumult Labs, PipelineDP (framework) by Google and Openmined, and Diffprivlib (library) by IBM.

Note: Frameworks are considered over underlying differential privacy libraries as they are intended for non-experts.

Figure 1: Benchmarked differential privacy tools — libraries and frameworks. References for 1, 2, 3, and 4.

We consider these tools because they provide support in the following aspects:
— Written in Python, the most popular language among practitioners.
— An active community for maintainability and support.
— Suitable for non-experts.
— Open-source and developed by prominent researchers and institutions.
— Scalable to support real-world use cases with large datasets.

OpenDP library, developed by Harvard’s Institute for Quantitative Social Science (IQSS) and the School of Engineering and Applied Sciences (SEAS) team, is part of the larger OpenDP Project, a community effort to build trustworthy, open-source software tools for the analysis of private data. It is feature-rich and built atop a modular and extensible framework for expressing privacy-aware computations using several different privacy models. It is implemented in Rust with Python binding support currently.

Tumult Analytics, developed by Tumult Labs — a startup specialising in differential privacy solutions and founded by pioneers in the field — is a user-friendly framework (familiar Pandas, SQL, and PySpark APIs) that provides scalable capabilities (100M+ rows and output statistics). The underlying framework (similar to OpenDP’s), Tumult Core, is a collection of components and composition operators that allow for implementing complex differential privacy mechanisms. Tumult Analytics is feature-rich and utilised by organisations such as the U.S. Census Bureau and the Wikimedia Foundation.

PipelineDP, developed by Google and Openmined, is an end-to-end framework for generating differentially private aggregations on large datasets using batch processing systems. The differentiating factor about PipelineDP is that it allows backend-agnostic data processing: Apache Spark, Apache Beam, and locally. The framework is built atop Google’s differential privacy building block library in C++. It is designed to handle SQL type of queries with APIs to handle user contributions. Practitioners should note that the framework is experimental and subject to changes; it is not recommended in production systems. Nevertheless, as the framework evolves, it can serve various large-scale applications.

Recommendation from the creators of PipelineDP —
“Note that this project is still experimental and is subject to change. At the moment we don’t recommend its usage in production systems as it’s not thoroughly tested yet.”

Diffprivlib, developed by IBM, is a general-purpose library offering extensive collections of differential privacy mechanisms beyond standard ones. Diffprivlib offers some of these mechanisms in various forms, namely their truncated and the bounded form. However, it relies on NumPy for all the underlying computations; users may want to note potential scalability limitations and security issues currently associated with NumPy. Notably, it also supports features to apply differential privacy on machine learning models such as classification and clustering, which run like Sklearn classifiers with the option to track the privacy budget.

Note: Our comparative analysis focuses on the common use of differential privacy to release aggregate statistics, such as mean and histogram.

Key Desiderata for Qualitative Analysis of the Tools

The following key desiderata, built upon the work of Gonzalo et al., are considered for the qualitative analysis of the libraries and frameworks that can be helpful to basic and advanced users.

Analytics — Implementing differentially private statistics can be challenging for various reasons, including calculating the sensitivity of a query, mechanism to use, accuracy optimisations, and so on. The tool abstracts these computations and provides high-level APIs for the same. The tool supports:

Differentially private aggregate statistics
— For single-valued (e.g., count, mean) and multi-valued statistics (e.g., histogram, contingency table).
— For GROUP BY queries, the tool supports (i) public (known categories that require explicit user input) or private categories (unknown or sensitive categories that are computed with privacy budget spending ), (ii) individual contribution bounding cross- and per-group, (iii) empty categories handling that can leak privacy, and (iv) group-by multiple columns.
Adaptive/interactive querying — Dynamic querying where the output of a query can depend on the previous query’s result, including determining how much privacy budget is consumed.

2. Security — The tool:

Blocks data visibility — Blocks the user from “seeing” the data execute arbitrary code or visualise the data.
Generates cryptographically secure random number generation — A randomised algorithm is as good as the randomness it brings in the random number generation.
Protects against floating-point vulnerabilities — Computers have limited time and memory, so they round continuous numbers and arithmetic results, creating vulnerabilities in differential privacy. The limited representation of floating points can leak privacy and break its guarantees. This is known as a floating-point vulnerability in differential privacy.

3. Usability — The tool supports:

Scalability: The user can input arbitrary-sized data. Many real-life datasets are too large to be loaded into memory and require distributed computing using engines such as Apache Spark.
Accuracy adjustment and error estimation: The user can set their desired accuracy level or get information about the noise scales and a confidence interval.
Parameter search: The user can control the accuracy of a query by defining the notion of neighbouring datasets and epsilon, estimating the smallest epsilon for the desired accuracy, and determining the necessary dataset size or maximum user contribution.
Loading multiple public/private data sources: The user can input multiple public and private datasets for advanced computations, such as performing a JOIN.
Pre-processing functionality: Data transformation might be needed before differential private statistical computation. These operations might include imputing or dropping null values, grouping into categories, type casting (e.g., string to float), etc.

Handling null values in a differential privacy setting can be challenging. By allowing aggregations to propagate null, they provide a non-differentially-private bit revealing the existence of nullity in the dataset, and by implicitly dropping nulls from aggregations (of known dataset size), the sensitivity of non-null individuals is underestimated. Therefore, aggregators must be fed completely non-null data by imputing or dropping them.

Post-processing functionality: The user can apply operations on the computed differentially private statistics, usually to minimise noisy signals.

4. Differential privacy features — The tool supports:

Privacy budget accounting
– Set and track the privacy budget spending
– Blocks queries on exhausting the budget
– Automatic budget splitting across queries
Differential privacy mechanisms: such as Laplace, Gaussian, and Exponential mechanisms.
Differential privacy definitions: such as pure, approximate, zero-concentrated differential privacy.
Composition: Computing all statistics together in a batch; this leads to more efficient usage of budget.
Population amplification: Optimisation of epsilon utilisation based on the sampling rate of the population
Privacy definition casting: Casting a privacy definition to another for privacy interpretation. For example, some privacy definitions cannot be interpreted in terms of epsilon and require definition casting.
Local differential privacy: Algorithm to apply differential privacy at the individual level (local differential privacy) rather than the aggregates (global differential privacy).

Table: Qualitative comparison of the tools on the key desiderata. ✔ denotes functionality exists; ✘ denotes functionality does not exist; ✝ denotes that the functionality is publicly stated by the creators to be added in the future. Please note some of the functionalities might be supported by the underlying core libraries but missing in the framework in the experimented versions. (refer to Table A in the Supplementary for an understanding of the concept)

Guidance For Practitioners On Choosing A Tool

Our evaluation suggests that while no single tool excels in qualitative and quantitative analysis, the differences among the tools can be bridged with engineering efforts. The teams behind the tools are responsive and active, and we expect libraries and frameworks to evolve rapidly.

Practitioners’ needs may differ, with some prioritising high security for critical and production use cases, while others may value feature-richness, such as support for diverse analytical queries and differential privacy mechanisms. Some may prefer features that support accuracy adjustment and automated parameter search. Based on our experimental analysis, we have compiled desiderata in Figure 2 (below), which practitioners can use when choosing a tool that suits their needs. For practitioners prioritising utility and execution time, our detailed quantitative analysis (next article in the series) and the summarised findings in Figure 2 can be helpful.

Figure 2: Recommended tools based on empirical analysis and qualitative features that practitioners might gravitate towards. Note that the recommendation on utility and execution time is based on our experimental analysis with the default settings of the tools.

Notably, each library also has a unique value proposition. Tumult Analytics and OpenDP have an extensible underlying framework that can easily integrate privacy computation and provide robust guarantees. Diffprivlib has differential privacy machine learning capabilities. PipelineDP is still experimental but offers a groundbreaking feature by providing backend-agnostic data processing and privacy-utility trade-off visualisation features.

Conclusion

Our analysis is based on the latest version of the libraries and frameworks available at the time of our experiments. Significant changes are expected as they continue to mature with new features, improvements, and research in differential privacy. However, our analysis provides a helpful starting point for practitioners to get insights into these tools and choose the one that suits their needs.

As these tools evolve, it is paramount to get active support from the creators. Crucially, we received active support during our study to help us implement the AWS environment. We appreciate the creators’ assistance and recommend following the future releases of the tools to take advantage of any new features or improvements. We encourage researchers and library creators to build upon our findings and support improving these tools for the betterment of society through responsible data sharing. We also value feedback from the community to improve our analysis further and ensure its accuracy.

Differential Privacy Series

GovTech’s DPPCC has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include PipelineDP by Google and Openmined, Tumult Analytics by Tumult Labs, OpenDP by the privacy team at Harvard, and Diffprivlib by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.

Part 1: Sharing Data with Differential Privacy: A Primer — A beginner’s guide to understanding the fundamental concepts of differential privacy with simplified mathematical interpretation.
Part 2: Practitioners’ Guide to Accessing Emerging Differential Privacy Tools (this article)
Part 3: Evaluating Differential Privacy Tools’ Performance — A comparative analysis of the accuracy and execution time of differential privacy tools in both standalone and distributed environments, with a focus on common analytical queries.
Part 4: Getting Started with Scalable Differential Privacy Tools on the Cloud — A step-by-step guide to deploying differential privacy tools in a distributed environment on AWS services, specifically AWS Glue and Amazon EMR, to support the analysis of large datasets.

The first three parts are also put together in this whitepaper. ✨

DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy and facilitate privacy-centric data sharing.

For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.

Thanks,
Syahri Ikram (syahri@dsaid.gov.sg), for the exceptional support and tireless efforts in helping me with cloud-based experiments.
Alan Tang (alantang@dsaid.gov.sg), Ghim Eng Yap (ghimeng@dsaid.gov.sg), Damien Desfontaines (Tumult Labs, Staff Scientist) and Prof. Xiaokui Xiao (School of Computing, National University of Singapore) for the valuable inputs.

Author: Anshu Singh (anshu@dsaid.gov.sg)

Additional Resources

Articles:

Papers

Videos

References

Bun, Mark, et al. “Differentially private release and learning of threshold functions.” 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.
Alabi, Daniel, et al. “Differentially private simple linear regression.” arXiv preprint arXiv:2007.05157 (2020).
Abadi, Martin, et al. “Deep learning with differential privacy.” Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016.
Stemmer, Uri, and Haim Kaplan. “Differentially private k-means with constant multiplicative error.” Advances in Neural Information Processing Systems 31 (2018).
Abowd, John M. “The US Census Bureau adopts differential privacy.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
Garrido, Gonzalo Munilla, et al. “Do I get the privacy I need? Benchmarking utility in differential privacy libraries.” arXiv preprint arXiv:2109.10789 (2021).
Garrido, Gonzalo Munilla, et al. “Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry.” arXiv preprint arXiv:2211.03898 (2022).
https://projects.iq.harvard.edu/files/opendp/files/opendp_white_paper_11may2020.pdf
Gaboardi, Marco, Michael Hay, and Salil Vadhan. “A programming framework for opendp.” Manuscript, May (2020).
Berghel, Skye, et al. “Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy.” arXiv preprint arXiv:2212.04133 (2022).
https://pipelinedp.io/
Holohan, Naoise, et al. “Diffprivlib: the IBM differential privacy library.” arXiv preprint arXiv:1907.02444 (2019).
Mironov, Ilya. “On significance of the least significant bits for differential privacy.” Proceedings of the 2012 ACM conference on Computer and communications security. 2012.
https://desfontain.es/privacy/renyi-dp-zero-concentrated-dp.html
Balle, Borja, Gilles Barthe, and Marco Gaboardi. “Privacy amplification by subsampling: Tight analyses via couplings and divergences.” Advances in Neural Information Processing Systems 31 (2018).
Hay, Michael, et al. “Principled evaluation of differentially private algorithms using dpbench.” Proceedings of the 2016 International Conference on Management of Data. 2016.
Bun, Mark, and Thomas Steinke. “Concentrated differential privacy: Simplifications, extensions, and lower bounds.” Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016.
Warner, Stanley L. “Randomized response: A survey technique for eliminating evasive answer bias.” Journal of the American Statistical Association 60.309 (1965): 63–69.