Why Differential Privacy?

Sudip Kar
BerkeleyISchool
Published in
4 min readJun 4, 2024
Image credit: iQoncept — Adobe Stock

Thanks to folks who read and liked my previous article on preventing information disclosure by GenAI using differential privacy. While I am still working on implementing the next phase of that project, I also know that people familiar with privacy mechanisms would ask, “Why can't we just use simple privacy approaches such as k-anonymity, l-diversity, t-closeness, etc.? Why differential privacy?”

This article is about answering that question in detail. For simplicity and to keep the article's length in check, I will focus only on k-anonymity. Many research and experiments (such as Composition Attacks and Auxiliary Information in Data Privacy by Srivatsava, et.al) show similar shortcomings with other models such as l-diversity, t-closeness, etc.

First, privacy algorithms such as k-anonymity were designed for non-interactive datasets. A non-interactive dataset is static and does not change after being published publicly. However, large language models (LLM) used in GenAI are real-time self-learning models. They work by studying a lot of real-time data and updating the model’s parameters to encode the relationships in that data. In short, the datasets in LLM are dynamic.

Second, the privacy models built for static datasets struggle when anonymized datasets are released by independent multiple sources on overlapping populations. So conventional mechanisms won't work for protecting privacy with LLM.

Third, these privacy mechanisms can be broken by auxiliary information (extra knowledge gained by an attacker that helps them to identify a person, to some degree or with a 100& confidence, in a dataset).

Fourth and most important, privacy mechanisms such as k-anonymity are vulnerable to composition attacks.

When we combine these shortcomings, it comes down to “static datasets released by independent multiple sources on overlapping population and attacker already has auxiliary information that can be used for a composition attack”. In this article, I will focus on this statement and in coming articles, show how differential privacy can mitigate this privacy risk.

To demonstrate that, let's assume we have 2 anonymized datasets from independent sources for overlapping populations as shown in Figures 1 and 2. {Sex, Age, ZipCode} are the quasi-identifiers (attributes on which we can apply anonymization), and {Diagnosis} is the sensitive attribute (attribute we cannot anonymize and is published as-is). Dataset A is 3-anonymous whereas Dataset B is 2-anonymous, therefore both the datasets satisfy k-anonymity with k = 3 and 2 respectively.

Figure 1: Dataset A from Source 1
Figure 2: Dataset B from Source 2

Also, as mentioned earlier, in this scenario, an attacker knows the following (auxiliary information) about a person P:

  1. Person P is a 27-year-old male.
  2. Person P is in both datasets.

Then the attacker knows that person P is diagnosed with AIDS with 100% confidence. So, even though both these datasets came from different sources and satisfied k-anonymity, they could not protect the sensitive attribute of 1 person due to a set of auxiliary information the attacker had. In most cases, an attacker might not be able to identify a person with 100% confidence but auxiliary information does help in narrowing down the population with a certain sensitive attribute. This attack is called an intersection attack, a type of composition attack.

So how does differential privacy (DP) help mitigate this information disclosure risk? DP algorithm typically relies on introducing a small, controlled amount of noise to the analyzed data. This noise is calibrated to be of the same magnitude as an individual’s possible contribution to the dataset, effectively masking the contribution or in our case, the presence of any one individual in a dataset. The amount of information revealed is parameterized by a variable epsilon, which also controls noise added. Lower epsilon means less can be learned, with more noise added (Good for privacy but utility takes a hit). Higher epsilon means more can be learned, with less noise added (Good for utility but privacy takes a hit). Therefore, the value of epsilon should be balanced to maintain the right privacy-utility balance.

I will demonstrate DP's working to mitigate this risk in future articles.

Final Thoughts

Lastly, a big THANK YOU to Prof. Daniel Aranki for his review of this article and the School of Information at UC Berkeley for publishing this article.

Thank you for reading this post. If you like this article, please share it with your friends and colleagues. Also, if you have any questions or feedback, feel free to leave a comment below!

--

--

Sudip Kar
BerkeleyISchool

Software Engineer by profession. Cyber enthusiast/reader/researcher by hobby.