Visualizing Privacy Trade-Offs for Sensitive Data

Published in

Technically Social

6 min readApr 5, 2022

By Priyanka Nanayakkara

Based on work by Priyanka Nanayakkara, Johes Bater, Xi He, Jessica Hullman, and Jennie Rogers

After the 2010 Census, the U.S. Census Bureau found that responses from over 46% of the population could be reconstructed exactly using published statistics. In other words, it was possible to go from published statistics to individual-level records describing ethnicity, race, age, sex, and census block. Statistics can help us learn patterns in the population but what happens when they also reveal sensitive information about individuals? This is a challenge inherent in releasing statistics — each one “leaks” information about the underlying dataset, increasing the probability that someone in the dataset can later be re-identified.

To address these issues some have started applying differential privacy approaches. Differential privacy is a definition of privacy that limits how much an analysis should vary if conducted with or without any given individual’s data. To date, it’s been applied by organizations like Apple, Google, Facebook (now Meta), Microsoft, and the U.S. Census Bureau, for a range of use cases. For example, Apple uses differential privacy to analyze emojis that are generally popular while limiting the extent to which individuals’ emoji preferences are learned. The U.S. Census Bureau is using differential privacy in releasing results of the 2020 Census.

Analyses Under Differential Privacy

Conducting an analysis under differential privacy often means adding a calibrated amount of noise to results, where the more noise we expect to add implies stronger privacy protections. A “privacy budget” parameter controls this trade-off. The higher the chosen privacy budget, the less noise we expect to be added to the result, and the higher the accuracy of the release (i.e., how close the release is to the actual result). The privacy-accuracy trade-off also exists with confidence intervals constructed under differential privacy, but setting privacy budgets for confidence intervals also requires considering how two sources of error — measurement error and noise from differential privacy — combine and impact the released confidence interval.

So how much differential privacy noise should be added? What is the right privacy budget for a given context? We designed a visualization interface, Visualizing Privacy (ViP), that allows a user to experiment with different privacy budgets and see their implications around accuracy of both point estimates and confidence intervals, and privacy in terms of disclosure risk so that they may weigh accuracy and privacy for specific scenarios.

ViP: An Interactive Visualization Interface for Setting Privacy Budgets

Imagine that a practitioner at a hospital wants to make public rates of hypertension by different subgroups (by ethnicity, age, race, and zip code). They think the public could benefit from these statistics but also need to protect the privacy of patients whose data are in the hospital’s database. In short, the practitioner must think about the trade-off between accuracy of the publishable statistics and privacy afforded to patients.

They turn to ViP (demo here) to navigate the trade-off. Below, ViP shows the queries the practitioner is interested in about rates of hypertension. The practitioner also wants to release confidence intervals for the population proportions. Additionally taking these into account, they must decide how much privacy budget to allocate to each query using the sliders in the “Privacy Use” tab. Adjusting a query’s slider dynamically updates its visualizations showing accuracy and privacy.

Visualizing Accuracy

Showing accuracy in a differential privacy context is challenging because it is probabilistic — some privacy-preserving releases are more likely than others, but we can only reason about the distribution of possible releases because we don’t know what the actual release will be once the privacy budget is finalized. Uncertainty visualization can be a useful tool for effectively conveying probability distributions and making uncertainty more visceral. ViP uses two of these techniques for showing accuracy: 1) quantile dotplots and 2) hypothetical outcome plots (HOPs).

Quantile dotplots are discrete representations of continuous probability distributions that ViP uses to show the distribution of potential privacy-preserving releases. They allow readers to make quick estimations of how likely a privacy-preserving release is to fall into a given range with the specified privacy budget.

A quantile dotplot shows the distribution of potential privacy-preserving releases. A cursor hovers over a bin to show a tooltip describing the quantile dotplot. — A quantile dotplot shows the distribution of potential privacy-preserving releases for a subquery.

HOPs, unlike quantile dotplots, are animated. Vertical lines over each dotplot animate showing potential privacy-preserving releases, further driving home the idea that once the privacy budget allocation is finalized, only one privacy-preserving release will be made available to the public. Static vertical lines for each result with no noise added are also shown as reference.

A HOP over a quantile dotplot animates potential privacy-preserving releases.

ViP also uses HOPs to show potential confidence intervals constructed under differential privacy. 50, 80, and 95% privacy-preserving confidence intervals are shown as a gradient and animate to show potential sets of intervals with static confidence intervals (constructed in the traditional way, without added differential privacy noise) are shown below as reference.

Traditional confidence intervals remain static while potential privacy-preserving confidence intervals are animated as a HOP.

Visualizing Disclosure Risk

Accuracy, however, is only one side of the trade-off. ViP shows privacy as disclosure risk under an attack model that assumes an attacker has access to all records in a dataset and is trying to guess whether an individual’s information was used in a computation based on some sensitive attribute. ViP plots an upper bound on the probability that the attacker correctly guesses whether an individual’s information was used in the computation after seeing the computation’s result. Each query has its own dot on the curve corresponding to the disclosure risk if only that query’s results are released. The black dot shows the disclosure risk if results from all the queries are released.

Updating the sliders dynamically adjusts dots on the curve.

A graph showing disclosure risk as privacy budget increases. Dots on the curve show disclosure risk corresponding to each query and all queries. — Disclosure risk visualization

Qualitative User Study

We conducted both a preliminary qualitative study with a version of ViP that supported only one query and an evaluative study with the multi-query version of ViP. In this blogpost we’ll very briefly go over some findings from the preliminary study, where we asked six clinical research professionals with little to no differential privacy background to answer questions about key differential privacy trade-offs using ViP. We asked participants to think aloud while answering questions. (For results from the evaluative study, see the paper or this blogpost.) Findings from the qualitative study offer insights into how interfaces may be useful in helping domain experts grapple with differential privacy.

For example:

We found that all participants could articulate key differential privacy relationships, including more nuanced aspects of these relationships, such as that accuracy does not increase linearly as the privacy budget increases.

Participants employed different strategies when setting privacy budgets without specific requirements.

Two participants focused on disclosure risk. One participant, for example, said they would consider “some of the non-mathematical features of the population” like whether the data described “illegal activities, sexual practices,” indicating that risk considerations may depend on the societal context of the data.
Four participants focused on both accuracy and risk considerations.

Three participants expressed confusion or concern over how to interpret disclosure risk, with one saying that “[it’s like] putting an absolute number on something that’s hard to quantify.”

Conclusion

Our work suggests that interfaces for differential privacy, like ViP, can aid in helping users keep track of key differential privacy relationships. Think aloud results from our preliminary study surfaced some potential challenges with using differential privacy in practice, for instance around how to interpret disclosure risk and what it means in a real-world context. Interfaces might be one way forward of identifying these issues with domain experts as they serve as a way of talking through relevant concepts without needing extensive differential privacy background or expertise.

For more details, please see our full paper:

Nanayakkara, P., Bater, J., He, X., Hullman, J., & Rogers, J. (2022). Visualizing Privacy-Utility Trade-Offs in Differentially Private Data Releases. To appear in Proceedings on Privacy Enhancing Technologies (PoPETS). https://arxiv.org/abs/2201.05964