# Evaluating Differential Privacy Tools’ Performance

*This article is the **third** in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click **here** to check out the other articles in this series.*

In the previous article of this differential privacy series, we provided guidance for differential privacy tools based on qualitative and quantitative analysis. However, we only provided details on qualitative analysis.

As promised, we will dive into the quantitative analysis in this article. We will evaluate the utility and scalability of the benchmarked tools — OpenDP (library) by Harvard privacy team, Tumult Analytics (framework) by Tumult Labs, PipelineDP (framework) by Google and Openmined, and Diffprivlib (library) by IBM on synthetic datasets. Our experimental setup is largely based on the work by Gonzalo et al. and built upon its findings. For completeness, we will provide a brief overview of the experimental settings.

All the experiments described in this article are open-sourced.🎊

We have included links to new and advanced concepts and supplementary material to help you follow along with the technical aspects of differential privacy presented in this article.

**Contents**

- Experiment Settings

— Evaluation principles

— Implementation details - Experiment Results

— Concepts to understand the experimental results

— Utility analysis

— Scalability analysis - Conclusions

# Experiment Settings

## Evaluation principles

**Diversity of dataset characteristics: **We used synthetic datasets generated using the *skew-normal distribution* to provide *scale, skew, *and* size* (refer to Figure 1) diversity.

*Scale:*The datasets of varying spread with values 50, 250, and 500.*Skew:*The datasets of varying shapes with values 0, 5, and 50.*Size:*The datasets of varying sizes range from 1000 to 1 billion data points. This wide range of sizes helps to test the tools’ scalability on large datasets in the standalone and distributed computing environment.

The diversity in the dataset characteristics can help simulate a reasonable model of real-world data distributions. We can comprehensively understand their behaviour and performance with diverse dataset characteristics by exposing the tools to datasets with varying *scales*, *skewness*, and *sizes*. This analysis is crucial for real-world applications where data is often complex and diverse in nature.

**Diversity of ε ( epsilon) values: **We ran the experiments on 28

**ε**values ranging from 0.01 to 10 with logarithm steps. The choice of the

**ε**values is informed by a combination of academic recommendations and practical considerations: experts in academics suggest positive

**ε**values of less than 1, and practitioners often use values beyond this range in real-world applications. Additionally, our pilot experiments indicated that the performance of the tools marginally differed for

**ε**values greater than 10. As a result, we limited the

**ε**values to 10 to provide a more focused evaluation of the tools’ performance.

**Diversity of analytics queries: **We evaluate the commonly-used queries supported by all the tools — *count, sum, mean, *and* variance.* Except for the *count* query, other queries depend on the content of the dataset and require setting the *clamping bounds *(refer to Figure 2). To ensure our evaluation was consistent across all the tools, we set the maximum and minimum values in the datasets as the *clamping bounds.*

What areclamping bounds?

Clamping bounds

limit data values within a specified range to minimise the influence of any individual, especially the outliers, on queries like sum, mean, etc. This is necessary to protect individual privacy and ensure the utility of the results.

**Metrics:** We used the *mean of relative error* (MRE) for the utility analysis to measure the accuracy of the queries’ output. MRE is widely used in the extant literature and allows the comparison of results from datasets of different scales. Lower MRE values indicate better performance. The error calculation baseline is the deterministic output of the queries. For the scalability performance, we measure the execution time (in seconds) of the queries.

**Implementation details**

The experiments were conducted on the AWS Glue (python environment) using a single DPU and 16 GB of memory. For the Spark-supported tools — Tumult Analytics and PipelineDP, experiments were conducted on AWS EMR utilising two r4.8xlarge instance types. As PipelineDP is backend-agnostic and supports standalone and distributed computing, we ran experiments on it in both environments.

Note: Readers who want to get started with scalable differential privacy can check out the final article in this series.

# Experiment Results

**Concepts to understand the experimental results**

** Neighbouring definition: **There are two definitions of

*neighbouring*datasets:

*unbounded*and

*bounded*. The

*unbounded*

*neighbouring*definition is used when a record is added or removed to hide the presence or absence of an individual in a dataset. For example, if a dataset contains information about patients in a hospital, the

*unbounded*

*neighbouring*definition might be used to ensure that the presence or absence of a particular patient is not revealed in the released data. The

*bounded*

*neighbouring*definition is used when a record is changed to protect the privacy of an individual attribute value when its presence is already known in the dataset. For example, if a dataset contains information about employees in a company, the

*bounded neighbouring*definition might be used to ensure that the salary of a specific employee is not revealed in the released data while still allowing the employee to be identified.

Both definitions are important for different use cases of differential privacy and help ensure that the release of sensitive data is protected against privacy breaches.

*Sensitivity***:** It is the maximum change in a query’s output on changing one record in a dataset, and the noise scale of a distribution increases with *sensitivity* (refer to Figure 3). *Sensitivity* computation depends on the *mechanism* used, the query to compute, and the *neighbouring* definition. Our work focuses on the *Laplace mechanism*, and refer to Table B in the Supplementary for the sensitivity computations of the experimented queries. We also refer the readers to Harvard’s papers on calculating the *sensitivity* of a query based on the *neighbouring* definition.

**Utility analysis**

**Setup: **We conducted utility analysis on synthetic datasets with varying *sizes* (1k, 10k, and 100k data points), *skew*, and *scale*, resulting in 27 datasets evaluated on the four queries — *count, sum, mean, *and* variance*. Each query was further run with 28 different **ε** values.

Most tools have default *mechanisms* and *neighbouring* dataset definitions (*bounded* or *unbounded*) for query computation (see Table 1). One consideration for the default *mechanism* is based on the data type; integer-valued data points default to the *Geometric mechanism* (also called *discrete* *Laplace mechanism*) that outputs integer random values and the float type defaults to the *Laplace mechanism*. We resort to the *Laplace mechanism* in cases with no default *mechanism* (*pure*-differential privacy).

Moreover, our implementation (refer to our GitHub repository) is tailored for a non-expert who might depend on automated parameter computation (such as *sensitivity*) and minimal inputs — **ε** value, *clamping bound* (set as maximum and minimum values of the dataset), and max user influence bounding (set as 1).

**Impact of dataset skew, scale, size, and ε value on the accuracy of the queries:** Our experimental results demonstrate a correlation between the noise scale and the error (MRE) of a query: decreasing noise scale resulting in reduced error. Given the

*Laplace distribution*’s noise scale is inversely proportional to

**ε**and directly proportional to the

*sensitivity*of the query:

- The error is higher for
**ε**values less than 1, which is a range recommended by some differential privacy pioneers. However, the error approaches zero as the**ε**increases. - The
*sensitivity*of queries increases as the spread of a distribution increases. And spread increases with low*skew*and high*scale*values. The*neighbouring*definition (*bounded*or*unbounded*) used to compute queries’ sensitivities also affected the noise scale (refer to Table B in the Supplementary).**Notably, the noise scale is directly related to the privacy protection offered — higher noise scale results in high privacy protection.**

Our findings further show that query performance improves with larger dataset *sizes* as the impact of noise diminishes. However, a large spread can still affect results due to probable outliers, even on large datasets.

We recommend that practitioners be aware of the high *sensitivity* of queries and compensate for it by selecting a higher value of **ε**, which improves accuracy. However, it’s important to consider the privacy implications carefully.

**Performance of tools on the queries:**

*Count *(refer to Figure 4)*: *The *sensitivity* of the *count* query is always one and independent of the *skew* and *scale* of a dataset. Hence, we excluded *skew* and *scale* variables from the *count* query results. Diffprivlib outperforms other tools for **ε** ≤ 0.1 and smaller datasets (observable for ≤10k data points) due to the underlying *mechanism* —* Geometric truncated* (refer to Table A of Supplementary) — which improves accuracy by mapping the negative *count* value to 0.

*Sum* (refer to Figure 5)*:* Tumult Analytics and PipelineDP perform better than the other tools for **ε** less than 1. This is due to the *unbounded* definition for *sensitivity* calculation used by Tumult Analytics and PipelineDP, while others use the *bounded* notion. Our synthetic datasets have negative and positive values, resulting in higher *sensitivity* for the *bounded* definition than the *unbounded* one. For example, if the maximum value is 100 and the minimum value is -5, with *bounded* definition, the *sensitivity* of the *sum* query will be 100 — (-5) = 105, whereas for *unbounded*, it will be max(abs(-5), abs(100)) = 100. However, please note that *bounded* differential privacy with a higher noise scale will give stronger privacy protection than the *unbounded* differential privacy.

*Mean *(refer to Figure 6)*:* The *sensitivity* calculation for the *bounded* and *unbounded* definition is similar; therefore, we can observe nearly similar performance among the tools. Furthermore, all the tools use the *Laplace mechanism* as the default (or we used it in the code). Moreover, the *skewness* and *scale* impact follows a similar *sum* query reasoning. This is because, except for the Diffprivlib (uses NumPy mean API), other tools use *sum* and *count* queries as the building block for the computation.

*Variance *(refer to Figure 7)*:* Tumult Analytics and PipelineDP perform better than other tools on large-*scale* and low-*skew* values (higher noise distribution spread). OpenDP performs better on the lower *scale* and large *skew* values where the **ε** is less than 1. The difference in performance can be attributed to the varying notions of *bounded* and *unbounded* *neighbouring* definitions, which result in different *sensitivity* calculations for the *variance* query. Diffprivlib performs the worst on the *variance* query, although its outputs are consistent with the actual *variance* values and are greater than zero. This is due to Diffprivlib’s use of the *Laplace Bounded Domain* (refer to Table A of Supplementary), which increases noise while maintaining output consistency for the *variance* query.

Most tools use *sum* and *count* queries as building blocks for variance computation or use similar *mechanisms*. As a result, the overall performance of the tools on *variance* demonstrates similar behaviour to the *sum* query across the three independent variables — *skew, scale,* and *size*, except for Diffprivlib, which uses NumPy’s *variance* API.

## Scalability analysis

**Setup: **In our scalability benchmark, we measured the execution time of the queries on datasets ranging from 1000 to 1 billion data points. We used fixed values of **ε** (0.01), *skew* (0), and *scale* (250), as existing literature suggests scalability is largely independent of sampled noise. Further, our computation time only measured the query execution and excluded pre-processing steps, as practitioners’ implementations can vary.

**Results:** For scalability assessment, we compare the tools based on the environment they were executed in — standalone and distributed.

Figure 8 shows the execution time of the Diffprivlib, PipelineDP, and OpenDP tools in standalone environments. The missing results in the figure indicate instances where the query computation for a library led to a memory error.

The Diffprivlib library performed best across all queries and scaled relatively well with increasing dataset size. This is because the underlying core functionalities of Diffprivlib are implemented in NumPy, as opposed to PipelineDP and OpenDP, which are Python- wrappers around their core functionalities in C++ and Rust, respectively. In general, all the tools do not scale on larger datasets, resulting in memory error or a substantial increase in time.

Also, our implementation of OpenDP includes using a binary search algorithm to search for a noise scale and prevent manual input. This feature ensures a differential privacy guarantee but also adds to the execution time when running the queries. For PipelineDP, the version we experimented with is tailored for GROUP BY type queries that create *partitions** *(subsets of the dataset). There are pre-processing steps for partition handling (such as adding empty *partitions *for public *partitions*) and individual contribution bounding across and within *partitions*. Global aggregations (which we experimented with), i.e., computing statistics on all the data points, are treated as one* partition*. The unnecessary pre-processing steps add some computational overhead, but in upcoming versions, we can expect added functionality to prevent such overhead for global aggregations, which will benefit practitioners whose use case involves running GROUP BY queries.

Figure 9 compares the execution time performance of Tumult Analytics and PipelineDP, two tools that support distributed computing, on the spark engine. Tumult Analytics predominantly outperforms PipelineDP on datasets with greater than 1 million rows across the queries. For a dataset of less than a million data points, PipelineDP largely outperforms Tumult Analytics. We also observed that Tumult Analytics has a higher memory consumption than PipelineDP. One reason is that Tumult Analytics creates temporary tables to avoid re-evaluating the query with fresh randomness for repeated queries.

**Conclusions**

In our empirical quantitative analysis, we provided insights into the tools’ implementation and how various factors can impact the accuracy of a query beyond the **ε** value, i.e., the *scale*, *size*, and *skew* of the data. Our analysis focused on the four most commonly used queries. However, there are other queries beyond these to experiment with (e.g., *quantile* and GROUP BY queries), including running multiple queries in a run that can optimise the privacy budget usage. In the future, we look forward to experimenting with other algorithms (such as *Gaussian* and *Exponential* algorithms), advanced privacy definitions (e.g., *zero-concentrated* differential privacy), and capabilities (e.g., population amplification and private dataset JOIN) offered by the tools.

## Differential Privacy Series

GovTech’s DPPCC** **has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include *PipelineDP* by Google and Openmined, *Tumult Analytics** *by Tumult Labs, *OpenDP* by the privacy team at Harvard, and *Diffprivlib* by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.

**Part 1: Sharing Data with Differential Privacy: A Primer****—**A beginner’s guide to understanding the fundamental concepts of differential privacy with simplified mathematical interpretation.**Part 2: Practitioners’ Guide to Accessing Emerging Differential Privacy Tools**— Explore the emerging differential privacy tools developed by prominent researchers and institutions, with practical guidance on their adoption for real-world use cases.**Part 3: Evaluating Differential Privacy Tools’ Performance***(this article)***Part 4: Getting Started with Scalable Differential Privacy Tools on the Cloud**— A step-by-step guide to deploying differential privacy tools in a distributed environment on AWS services, specifically AWS Glue and Amazon EMR, to support the analysis of large datasets.

The first three parts are also put together in this whitepaper. ✨

DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy and facilitate privacy-centric data sharing.

For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.

*Thanks, *

*Syahri Ikram*

*(syahri@dsaid.gov.sg), for the exceptional support and tireless efforts in helping me with cloud-based experiments.*

*Alan Tang*

*(alantang@dsaid.gov.sg),*

*Ghim Eng Yap*

*(ghimeng@dsaid.gov.sg),*

*Damien Desfontaines*

*(*

*Tumult Labs*

*, Staff Scientist) and*

*Prof. Xaiokui Xiao*

*(School of Computing, National University of Singapore) for the valuable inputs.*

**Author**: Anshu Singh (anshu@dsaid.gov.sg)

## Additional Resources

**Articles:**

- Damien Desfontaines differential privacy series
- Programming differential privacy: A book about differential privacy
- Differential privacy for dummies

**Papers**

- Differential Privacy: A primer for a non-technical audience
- The algorithmic foundations of differential privacy by Cynthia Dwork

**Videos**

- A friendly video on reconstruction attack by MinutePhysics
- A practical beginners’ guide to differential privacy
- (Course) Differential privacy course by Gautam Kamath of University of Waterloo

## References

- Bun, Mark, et al. “Differentially private release and learning of threshold functions.” 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.
- Alabi, Daniel, et al. “Differentially private simple linear regression.” arXiv preprint arXiv:2007.05157 (2020).
- Abadi, Martin, et al. “Deep learning with differential privacy.” Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016.
- Stemmer, Uri, and Haim Kaplan. “Differentially private k-means with constant multiplicative error.” Advances in Neural Information Processing Systems 31 (2018).
- Abowd, John M. “The US Census Bureau adopts differential privacy.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
- Garrido, Gonzalo Munilla, et al. “Do I get the privacy I need? Benchmarking utility in differential privacy libraries.” arXiv preprint arXiv:2109.10789 (2021).
- Garrido, Gonzalo Munilla, et al. “Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry.” arXiv preprint arXiv:2211.03898 (2022).
- https://projects.iq.harvard.edu/files/opendp/files/opendp_white_paper_11may2020.pdf
- Gaboardi, Marco, Michael Hay, and Salil Vadhan. “A programming framework for opendp.” Manuscript, May (2020).
- Berghel, Skye, et al. “Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy.”
*arXiv preprint arXiv:2212.04133*(2022). - https://pipelinedp.io/
- Holohan, Naoise, et al. “Diffprivlib: the IBM differential privacy library.” arXiv preprint arXiv:1907.02444 (2019).
- Mironov, Ilya. “On significance of the least significant bits for differential privacy.” Proceedings of the 2012 ACM conference on Computer and communications security. 2012.
- https://desfontain.es/privacy/renyi-dp-zero-concentrated-dp.html
- Balle, Borja, Gilles Barthe, and Marco Gaboardi. “Privacy amplification by subsampling: Tight analyses via couplings and divergences.” Advances in Neural Information Processing Systems 31 (2018).
- Hay, Michael, et al. “Principled evaluation of differentially private algorithms using dpbench.” Proceedings of the 2016 International Conference on Management of Data. 2016.
- Bun, Mark, and Thomas Steinke. “Concentrated differential privacy: Simplifications, extensions, and lower bounds.”
*Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I*. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016. - Warner, Stanley L. “Randomized response: A survey technique for eliminating evasive answer bias.”
*Journal of the American Statistical Association*60.309 (1965): 63–69.