by Justin Sherman and Ryan Sherman
Contemporary machine learning algorithms operate at a near-total expense of the privacy of user data. This is arguably the primary reason why tech firms like Facebook, Google, Amazon, Netflix, Alibaba, Polar, and others run mass data-collection programs that target consumers: they provide more “training data” with which the companies can bolster their algorithms, translating into higher profits.
As our world becomes increasingly run by automated decision-making tools — specifically, artificially intelligent algorithms — we need to protect the privacy of citizens’ data; this is particularly true for already oppressed or disadvantaged communities for whom privacy violations yield disproportionate adverse impacts. Thankfully, novel research in what we term “privacy-centric AI” shows promise for automating our world while simultaneously protecting the confidentiality of individuals’ information.
What is “privacy-centric” AI?
Broadly speaking, we use this in reference to AI algorithms that are shielded from viewing or unsafely storing/using the personally identifiable information of specific individuals. The goal is to maintain the accuracy, precision, and efficiency of contemporary machine learning models but with privacy as an integrated design feature — in short, yielding what is more of a win-win than exists today.
This phenomenon is well-summarized by Florian Tramèr, PhD student in Computer Science at Stanford University. “The amazing thing about privacy-preserving machine learning is that there need not be a fundamental tension between privacy and utility,” he told us. “In principle, privacy constraints (e.g., the learned model should not leak any of my individual data) can act as strong regularizers and thus benefit generalization.”
Research advances in this area, as Tramèr hinted, could positively impact a company’s profits by diminishing or eliminating fundamental tensions between privacy and utility. But there should also be benefits for society at large: improved privacy tools “will allow for innovation in fields that previously were too sensitive (or too tied up with red tape) to address,” says Andrew Trask, leader of OpenMined and PhD student at the University of Oxford. “[This] will undoubtedly lead to progress on challenging social problems.”
Those using machine learning in data-sensitive environments — often working on challenging social problems like medical diagnostics — could certainly benefit from advancements in privacy-centric artificial intelligence.
Older techniques that strove for privacy-centric AI yielded mixed results, such as attempts to “anonymize” training data generated from users. “When Netflix posted 10 million movie rankings by 50,000 unnamed customers, UT-Austin researchers ‘outed’ some movie watchers by mapping the data to publicly available IMBD data,” recounts Bob Sullivan, a veteran journalist and advisor to Ethical Tech. “Earlier, AOL released ‘anonymized’ search queries, only to have The New York Times cross-reference the data with phone book listings. There are better ways to scrub data, but there are also better ways to unmask it, too.”
While anonymization may still be a shaky “solution” at best, there are thankfully several alternatives which hold promise for the construction of privacy-centric AI: integrating encryption with machine learning; implementing differential privacy with machine learning; and using trusted hardware to train AI algorithms.
Machine learning and encryption
Both machine learning and cryptography have received significant attention over the last decade, but there is little ongoing work at their intersection. Data is usually left unencrypted during the training of a machine learning model, which leaves (often sensitive) information, like medical histories or spending patterns, vulnerable. However, homomorphic encryption — where operations can be performed on data (e.g., addition, subtraction) without decrypting it — can enable machines to execute the intensive computations required to train a machine learning model with lowered risk of breaches of data confidentiality.
The future of the field is in the hands of machine learning researchers as well as cryptographers: ML researchers must develop faster and better algorithms while cryptographers build faster and safer encryption. As Andrew Trask told us, “the next big wave of AI-related research and subsequent entrepreneurship is a convergence with the field of cryptography, empowering owners of data and models to create value while better retaining privacy and ownership over their assets in the process.”
By integrating privacy into their machine learning applications, organizations can better protect their data from theft (e.g., resisting attacks against neural networks). In fact, Numerai, an open-sourced hedge fund, puts machine learning and encryption at the core of its work. Numerai releases encrypted stock datasets for developers to train their machine learning models, and developers are paid based on their models’ performance. In the words of its founder, Richard Craib:
Once you have a model in finance that works, you hide it. You hide the techniques you used to build it. You hide the methods you used to improve your data. And most importantly, you hide the data. The financial incentive for secrecy is strong.
Leveraging encryption for machine learning keeps individuals’ information private while also guarding valuable secrets from an organization’s competitors. The encrypted machine learning field may be starting to take off, which makes it an area ripe for research and innovation. Companies, agencies, and researchers should therefore take active steps to explore the intersection of encryption and machine learning.
“Many industry players, including Google and Apple,” says Florian Tramèr, “are experimenting (and starting to deploy) differentially private learning today.” It’s a breakthrough theory in the field of computer science, particularly as applied to machine learning.
Differential privacy allows companies to collect user data while (a) minimizing one’s ability to identify whether one’s data is part of the larger set and (b) still preserving some level of accuracy.
In other words, we could have our data used in a machine learning algorithm — say, for instance, one that curates our social media news feed — but there would be some “noise” added to the data before it was stored. This would still enable the algorithm to identify broad trends over everyone’s data (e.g., that we love news confirming our beliefs), but it would protect an individual’s privacy in the process.
It’s a somewhat confusing topic, and research in differential privacy is just beginning to emerge. But in the larger scheme of privacy-centric AI, these techniques hold much promise for undermining the myth of privacy-versus-accuracy. We may not have to sacrifice user privacy, as we previously mentioned, to get equal or better ML performance.
Among those recognizing this fact is Apple, which proclaims that “differential privacy is used as the first step of a system for data analysis that includes robust privacy protections at every stage.” In December 2017, the company even released information on its differential privacy implementation in the iOS system. This is a direction that organizations large and small should head: protecting user privacy without having to completely compromise on the utility of their machine learning models.
Over the last few decades, there has been a realized need to secure devices at lower and lower levels: moving from users to firewalls, from firewalls to secure operating systems, and so on. Techniques aiming to “bootstrap” trust into machines have thus made their way down to computer hardware itself. Researchers are working on ways to ensure machines themselves are resistant to attacks that, for instance, turn a laptop’s electromagnetic field into documentation of a user’s keystrokes.
Because machine learning models need to be trained with intensive computing power, often on cloud servers — and because cloud computing brings a plethora of security threats, many of them hardware-related — trusted hardware yields further promise for privacy-centric AI.
“Trusted hardware is likely to become the pragmatic approach to secure outsourced machine learning (and secure cloud computing more generally),” says Florian Tramèr. “While cryptography can be used to tackle this problem, a major breakthrough would be required to make this practical for modern workloads. In contrast, trusted hardware solutions scale gracefully to today’s AI computing needs.”
This is not to say that trusted hardware solutions are perfect, necessarily encrypting all training data and thus preventing compromises of individuals’ privacy; techniques in this area are still developing, which Tramèr emphasized in our conversation. But if we can place greater trust in hardware devices, specifically the cloud computers which increasingly train ML models, it’s another way in which we can construct effective, privacy-centric AI.
For machine learning, privacy and utility are often viewed in diametric opposition to one another. We now see, however, that this doesn’t have to be the case; ML algorithms can in fact protect the privacy of user data while still maintaining or even improving upon current levels of efficiency, accuracy, and precision. And from the near-constant news headlines on data breaches to political manipulation like in the Cambridge Analytica scandal to instances of highly prejudiced algorithms exacerbating societal inequality, the need for data privacy has never been greater than it is today.
We repeat: the need for data privacy has never been greater than it is today.
As Bob Sullivan articulated, “companies working on sensitive data projects need to take far more precautions than they often do — precautions which might seem to be cost-prohibitive.” The techniques we just discussed are examples of such precautions. While there may be short-term financial costs to research and implement these processes, there are also short-term benefits for competitive differentiation and long-term benefits for society. Privacy is important, which means relevant safeguards must be implemented in our increasingly automated world.
Justin Sherman is a student at Duke University and the co-founder and Vice President of social venture Ethical Tech (@ethicaltechorg). Ryan Sherman is a high school senior and independent deep learning researcher working on machine learning’s applications in drug development and safe artificial intelligence.