Differential Privacy considered not practical
A criticism of Differential Privacy and the “Privacy — First” approach.
Differential Privacy is an elegant & beautiful mathematical theory. Much like the Statistical Learning Theory (by Vapnik & Chervonenkis) and Probably Approximately Correct learning framework (by Leslie Valiant), it provides fundamental insights about privacy and data mining.
The “Privacy-first vs Utility-first” distinction, was introduced in now a 6 year old blog post by Arvind Narayanan. Sensing a “Privacy-first” paradigm shift, Arvind declared a consensus on Differential Privacy. However in the last six years it has failed to gain any traction.
- MSFT has several patents at least for another decade, and unless your hobby includes negotiating with them for forgoing part of your revenue. This should alone dissuade you.
- There are zero actively maintained code-bases, Airavat, Gupt, PINQ are all dead.
- No deployed applications.
- No algorithms for answering queries on a large & complex datasets in a reasonable time and with reasonable accuracy.
- Fundamental problems about allocation of budgets are completely unexplored.
- Papers such as “No free lunch in Data Privacy” have shown that the widely accepted practice of releasing deterministic statistics (e.g. Number of male & female patients affected by Zika in USA), is incompatible with guarantees offered by Differential Privacy.
- Microsoft Research the birth place of Differential Privacy has abandoned it.
- Arvind himself has abandoned his own blog, and shifted research focus to cryptocurrencies. Well honestly Arvind was never a strong proponent of Differential Privacy. He in a recent paper recommended a combination of approaches including de-identification in instances where the trade-offs are reasonable.
In my opinion Arvind conflates “Theory-first” with “Privacy-first”, Differential Privacy is not necessarily about putting “Privacy” first but rather about putting algorithms with only theory behind them first.
“Theory-first” approaches are typically favored by Computer Scientists since they allow for sophisticated mathematical analysis rather than “Utility-first” engineered solutions. While theory surely has its own place in academia, it would be dangerous to completely rely on it while ignoring the practical trade-offs. In a world where Snapchat is popular in spite of its Analog hole one must revisit the underlying strong assumptions about attackers in Differential Privacy. Do they truly match social norms?
Lessons from Machine Learning
Machine Learning offers an interesting parallel to Data Privacy. As the current resurgence of Deep Convolutional Neural Networks shows, elegant theories and practical engineered solutions often diverge. Solving problems involves dealing with noise, and fundamental uncertainties that arise with each type of problem. More often than not tools provided by theories don’t encode this information efficiently.
The machine learning community spent multiple years trying out a “Theory-first” approach with SVMs and Kernel methods only to later abandon it and adopt a more empirical approach. The empirical approach resulted in an interesting phenomena where solutions, which were ignored due to lack of good theory to explain their performance (and to an extent technical challenges in their implementation), suddenly became popular.
Assuming that Differential Privacy is the only solution and abandoning all other approaches is equivalent to only using Machine Learning algorithms that have a small VC dimension with convex error, while ignoring practical solutions such as Dropout regularization, back propagation and non-convex error functions.
Rather than adopting a “theory-first” approach that starts with a mathematically sound theory of privacy and waiting for development of “good enough” algorithms for anonymized analysis. Policymakers & Researchers ought to adopt a “utility-first” approach that is grounded in availability, empirical accuracy measures and reasonable models of adversaries.
Budgeting ain’t easy
Assuming one ignores all other practical issues such as availability of external statistics, un-bounded noise distributions, lack of well maintained code. One still runs into a problem of budgeting.
Almost all differential privacy papers, typically include a disclaimer that approximately translates to following.
Requires more investigation for recommended values of privacy budget, epsilon, delta and their impact on accuracy. Authors plan to investigate this in future.
However discussion of accuracy and even budgets, is highly dependent on context. In order to asses accuracy/utility/privacy trade-off correctly one must posses enough information about queries expected, and how much privacy budget should be allocated towards them. Keeping track of queries and budget consumed is even more difficult. Certain tasks require answers to be computed as exactly as possible. The impact from answering such queries and resulting depletion of privacy budget has never been examined in literature.
May be you decide to hard code rules to determine budget for queries. Some attributes are more important than others. Soon you realize that you have ended up hard-coding preferences. It was the task that you tried to avoid in the first place when you did not use the L-Diversity, K-Anonymity type approaches.
There have been few attempts at rigorously evaluating privacy budgets and parameters such as epsilon. However they have only shown Differential Privacy to be impractical.
Several other “Gotchas” appear the moment you start thinking about your data. May be each row in your database is a “Receipt with list of items bought and other metadata” but you need to protect privacy of buyers (people) each of whom might have several receipts. May be the adversary is searching for Walter White buying wooden matches in bulk. Differential Privacy in spite of its intentions isn't going to protect him. It will protect one of his purchase but not all.
If you have come across a glowing article about Differential Privacy, or enamored by the beautiful math behind it, here is the sad conclusion. There is No
from differential_privacy import Budget, LaPlaceMechanism, MWEM
to help you out. There are zero large scale systems currently deployed. No one is using Differential Privacy in the setting that is typically recommended by the news/magazine articles. Privacy budget allocation remains a complete mystery.
If you need to build systems that actually work / are useful in practice, you are better off using K-Anonymity, L-Diversity, T-Closeness, etc. and maybe wrapping them inside a query interface. As of 2016 Differential Privacy isn't a practical solution.
If you are looking for Scientific Paper that basically says the same thing, following is a good reference
On syntactic anonymity and differential privacy
C Clifton, T Tassa — 2013 IEEE 29th International Conference on Data Engineering Workshops, 2013
P.S. You will encounter strong resistance and in some cases outright bullying from vocal supporters of Differential Privacy. These people will likely be completely unfamiliar with your area of application and existing practices. They will barge in into public meetings proclaiming that Differential Privacy is the only real solution. And that researchers who have spent years working with data in that field should be ignored. Even when several papers have called for further study of Differential Privacy before adoption in practice.
Sadly you will have to fight them. In which case “May the Force be with you”
Reference for the image: https://www.usenix.org/sites/default/files/conference/protected-files/sec14_slides_fredrikson.pdf