Privacy-preserving Collaborative Machine Learning
Robin Geyer, Moin Nabi and Tassilo Klein
In 2006, a major US online service provider released a large number of their user’s search logs for academic purposes. And even though users were not identified, only two days after the release the New York Times was able to trace back an individual search log to Thelma A., a then 62-year old woman from Georgia. This incident lead to the resignation of the company’s chief technical officer (CTO).
Considering this experience, data sets that are publicly distributed should better be sanitized from personally traceable information. But what if there is no trusted party taking care of such sanitization to ensure privacy?
Learning patterns from private data distributed among multiple parties without the need of an overly trusted curator is a challenge that has gained lots of attention in recent years. Examples range from hospital servers with patient information to mobile phones/wearable devices with personal data, such as images, messages, or behavioral patterns. There is a variety of approaches from different communities trying to tackle the arising problematic of wanting to learn as much as possible from a crowd, without revealing information from an individual member or having an overly trusted party involved. This post will give a short overview on three of these trends:
Secure Multi Party Computation
Let’s suppose a group of colleagues wants to compute their average salary without revealing their individual ones. In these types of problems secure Multi Party Computation (MPC) comes in handy.
At the end of a secure MPC process, each colleague only has as much information about the others as inferable through his/her own salary and the computed average. In short, he or she will know if on average colleagues earn more than him/her, but he or she won’t have more information about an individual colleague.
Under an assumption about a maximum number of malicious and gossipy parties during the computation, secure MPC provides exceptional security. However, the usage of secure MPC in terms of more complex computations, such as training a neural network, is very limited. Computation and communication costs explode with growing complexity of the function to be computed (and the number of parties). Training even simple machine learning models gets extremely expensive communication-wise.
When it comes to training a model on private data, one of the critical factors is data centralization. On the one hand, the process of centralization could be undermined. On the other hand, there might be no party that is trusted enough to centralize the data. Coming back to the group of colleagues calculating their average salary: Whom should they all trust? By shifting centralization from data space to parameter space, Federated Learning tackles this bottleneck: It enables learning from data without centralizing it, shifting the optimization steps to the client’s servers, centralizing only learned parameters instead of the data itself (which, in most cases, also reduces the communication costs).
The clients that receive the combined finished model, however, can infer less information about the original data than the central party that received the individual models separately and combined them to the finished model. This disparity in information makes the process of federated learning non-secure from a MPC perspective.
Let’s once more recap the average salary example from before: Getting only the information about the average seems privacy preserving, right? Well, now imagine you’d check this information before and after a worker is newly employed or leaves the company, together with the head count you’d have perfect information about that person’s salary.
A differential private method ensures the anonymity of each member in the group against such kind of information retrieval. A differential private algorithm that provides an estimate of the average salary of all employees would be required to give roughly the same output whether a certain employee is in the data set or not.
Neither secure MPC nor the federated learning setting can guarantee differential privacy as the drop out of an individual party or data point might influence the model so that private information about that party is inferable.
Machine learning under security and privacy issues has attracted a lot of attention in recent years, and in this post, just a couple of trends have been covered. Other important security/privacy related areas of research that are not discussed in this post include adversarial attack, and fairness of algorithms only to name a few.
At SAP Machine Learning Research, a team of researchers and students is tackling the broad area of challenges machine learning under privacy implicates. We try to find secure and privacy ensuring ways of learning from big amounts of sensitive but possibly valuable data.
Robin Geyer, a M.Sc. Student from ETH Zurich will be focusing his attention on the challenges outlined in this post.