A collaborative analytics solution that ensures input secrecy, output privacy and high accuracy
It is well known that data collection is the focus and the monetization basis for some of the largest companies in the world. Moreover, machine learning solutions are increasingly deployed and they require massive amounts of training data. Such valuable data, however, contain sensitive, personally identifiable information, e.g., health statistics, telemetry and behavioral data, biometric information, financial transactions, etc., and they require special processing and cannot be easily collected or shared under recent data protection legislation such as GDPR. To complicate things further, valuable data remains hidden in distributed data silos of different institutes and companies that cannot and don’t want to combine their sensitive data.
However, holistic insights can be gained from combining such distributed data when privacy can be ensured. Some real-world privacy-preserving use cases can be found at government institutes, e.g., to find tax fraud by using the data of different government institutes (that otherwise could not collaborate) or survey wage gaps by ensuring employers they can safely participate in such studies. Large companies like Google and Mastercard already linked online ads with offline purchases for aggregated ad conversions. However, these examples mostly support simple aggregate statistics (e.g., sum, mean), not more complex order statistics (e.g., max, median), and protect only the inputs without additional privacy guarantees for the output.
To increase the potential use cases and to unlock and combine more data hidden in data silos, the following question needs to be answered:
How can one expand and improve collaborative analytics to support data-driven innovation while protecting fundamental privacy rights?
Let’s use order statistics
Existing solutions of collaborative analytics mainly focus on aggregate statistics, like sum of private values. Informally, such solutions work by securely adding all private values together — which provides some informal (and insufficient) privacy protection by making individual contributions harder to find (“hiding in the crowd”). Improved solutions, satisfying a formal privacy guarantee, also add noise to the sum.
But things get more complex if we consider order statistics, as one needs to know the position (order) of each element in the sorted data and the result is one of the private input values and not some aggregate. Order statistics are a robust and versatile measure which includes the minimum and maximum value, the interquartile range (IQR) and the median.
SAP Security Researcher Jonas Boehler and Prof. Florian Kerschbaum from the University of Waterloo investigated how it is possible to enable privacy-preserving order statistics over distributed data while protecting inputs as well as outputs. Their work considered general order statistics, but for the moment let’s focus on the median, which is the element in the middle of a sorted data set:
The median is an important robust statistical measure which represents a “typical” value in the data: insurance companies use median life expectancy to adjust insurance premiums and financial surveys report median income as it is more robust to outliers than the mean. To illustrate why outliers make a drastic difference for the mean but not the median, let’s take a look at the income in Medina, Washington. A small suburb near Seattle, home to Amazon’s and Microsoft’s headquarters, with a population of about 3,000. The median income is around $186,000. The mean income, however, exceeds $1,000,000,000. Why? Due to only two outliers in Medina: Jeff Bezos and Bill Gates.
Input Secrecy & Output Privacy
Existing research to compute a privacy-preserving median relies on an additional trusted third party, which has access to the combined data in the clear. However, such a trusted third party is a single-point-of-attack and might not solve all regulatory issues.
An alternative is secure computation, which uses cryptography to protect the sensitive, secret inputs and only reveals the exact output of the computation, i.e., it provides “input secrecy”. Unfortunately, the output of some computations, such as order statistics, is one individual’s private input. Thus, learning the exact result already leaks private information and offers no privacy protection.
There are solutions that provide “output privacy” satisfying a strong privacy guarantee called “differential privacy”. However, these solutions also require a trusted third party or multiple randomizations (e.g., each party adds random noise), which reduces the accuracy of the computation output.
Can we protect Inputs and Outputs?
Böhler and Kerschbaum propose a hybrid solution providing input secrecy as well as output privacy with high accuracy for two parties. Their research paper “Secure Sublinear Time Differentially Private Median Computation” was published and presented at the Networks and Distributed Systems Security Symposium (NDSS), held February 2020 in San Diego.
The research was done in the course of the ongoing EU project MOSAICrOWN (“Multi-Owner data Sharing for Analytics and Integration respecting Confidentiality and OWNer control”), which “aims to enable data sharing and collaborative analytics in multi-owner scenarios in a privacy-preserving way”.
The idea is to combine different cryptographic techniques to efficiently compute a differentially private median. Their privacy-preserving protocol is optimized by computing as little as possible using cryptographic tools and by applying dynamic programming with a static, i.e., data-independent, access pattern reducing the complexity of the secure computation. A comprehensive evaluation with a large real-world payment data set with millions of records achieves a practical runtime of less than 500 milliseconds in a LAN and less than 7 seconds in a network with parties in Frankfurt and Ohio (100 milliseconds latency and 100 Mbit/s bandwidth).
Privacy-preserving collaborative analytics has the potential to allow new, holistic insights over distributed sensitive data and SAP Security Research plans to further expand such analytics, e.g., supporting more parties, in future work.
For more details on secure computation of differentially private statistics you can read the paper or watch the recorded NDSS presentation.
Discover how SAP Security Research serves as a security thought leader at SAP, continuously transforming SAP by improving security.
This article was first posted on SAP Community Blogs