Privacy-Preserving Machine Learning 2018: A Year in Review

With Ben DeCoste and Gavin Uhma.

2018 was a breakout year for privacy-preserving machine learning.

From public awareness of data breaches and privacy violations to breakthroughs in cryptography and deep learning, we now see the necessary conditions for investment in privacy-preserving machine learning.

It has become clear that people care about data privacy. A recent study from the Future of Humanity Institute suggests that the American public prioritizes “preventing AI-assisted surveillance from violating privacy and civil liberties” highest among critical AI governance issues. Data privacy impacts our politics, security, businesses, relationships, health, and finances.

In this post, we highlight the news, research, code, communities, organizations, and economics that made 2018 the breakout year for privacy-preserving machine learning.

News

This year’s news cycle had several major stories surface around data privacy, making 2018 the most relevant year for privacy since the Snowden leaks in 2013.

Google Trends for “data privacy”, 2013 — 2019

The General Data Privacy Regulation (GDPR) came into effect in the European Union in May, marking the first extensive rewrite of privacy law for a major world power. China followed suit by passing their own updated data protection law, although it is currently established as a national standard that’s not yet legally-binding. Privacy law is quickly evolving in an attempt to keep up with technological innovation.

Meanwhile, global consumers are starting to see why such regulation is necessary to protect society at large. It’s difficult to rank the corporate data leak scandals of 2018.

A report from the Guardian in March detailed a massive perception manipulation campaign from Cambridge Analytica, made possible with Facebook user data that was collected and mined in violation of Facebook’s Terms of Service. The authors questioned the degree to which the U.S. 2016 Elections were affected by this campaign. Special counsel Robert Mueller’s indictment of Russian troll farms like the Internet Research Agency fueled further concerns.

Meanwhile, more details around the method and scope of the Equifax hack from late 2017 surfaced. Facebook experienced its largest data breach in history, affecting nearly 50 million users. A vulnerability in Google+ exposed the personal data of 500,000 users. Marriott International experienced what may have been the largest known data breach in history, affecting as many as 500 million users with 25 million unique passport numbers stolen. Uber finalized a $148 million settlement from the cover-up of its 2017 data breach, which affected 57 million drivers and riders. These all seem to have been understood as concrete threats to individual citizens and consumers, further focusing the public eye on data privacy practice and regulation.

As a consequence of this renewed data privacy interest, the media has been raising more questions about what corporations do with consumer data that hasn’t been breached. Consumers are discovering more about the practices of a pervasive and thriving data trade. GlaxoSmithKline announced a partnership with 23andMe, while it was revealed that Google has data on 70% of all US credit card transactions through a secret agreement with Mastercard and other financial partners.

Meanwhile, a front-page article from the New York Times demonstrated how a person’s location data can be pieced together to unveil invasive, detailed personal habits and character traits. The same article investigated how users’ location data is being sold on the market; one weather app had sold its users’ raw location history to over 30 different third-party services. Similarly, two leading privacy advisors to the Sidewalk Toronto smart city project by Alphabet’s Sidewalk Labs quit over surveillance concerns.

Others have noticed the heightened concerns around data privacy and responded in kind. Both Apple and Microsoft have declared privacy to be a fundamental human right. Apple has included this statement in their iOS and OS X on-boarding, while Microsoft has pledged not to profit from users’ personal data. Both companies have published work from their internal privacy research and engineering teams.

How does machine learning impact data privacy?

Machine learning powers many of the products we use today, like social feeds, voice assistants, maps, ads, and auto-complete.

Because machine learning is data hungry, it can have a negative impact on data privacy. Accessing data for machine learning increases the surface area for attack. Data scientists are given access to data that may have been previously siloed. Neural networks can memorize information from data sets that can be extracted through statistical inference attacks and GANs. Anonymized data can be de-anonymized, also known as re-identification attacks.

The increasing utility of data from machine learning has a negative effect on privacy too. Data that was seemingly innocuous through the eyes of a human, suddenly produces insights and predictions that only a machine could infer. The mosaic effect states “disparate pieces of information — although individually of limited utility — become significant when combined with other types of information”. What happens when an insurance company pieces together your credit card transactions, location history, genetic profile, and browser history?

These properties of machine learning seem quite negative from the perspective of data privacy, but they are also prerequisites for deep learning to be able to positively transform our society. In healthcare, for example, this technology can save lives, but we shouldn’t have to risk our own sensitive data.

Given so many examples of severe data breaches, privacy-preserving machine learning seems more urgent than ever. What progress was made in the technology of privacy-preserving machine learning in 2018?

Research

This year saw a flurry of work across private ML, including significant advances in secure multi-party computation, homomorphic encryption, differential privacy, federated learning, and secure enclaves. Many researchers have begun mixing techniques from each of these traditionally separate fields to achieve stronger security models, faster runtime, or improved generalization performance.

Perhaps the most popular topic of the year was differential privacy. Early in the year, the Google security and privacy team released a follow up to the Private Aggregation via Teacher Ensembles (PATE) framework. The follow up improved on several limitations of the original paper by presenting new aggregation mechanisms that help manage the differential privacy budget more efficiently during student training. Later in the year, lead PATE author Nicolas Papernot published his Marauder’s Map, a set of best practices for research into privacy and security of machine learning.

A majority of the invited and contributed talks from the Privacy Preserving Machine Learning workshop at this year’s NeurIPS conference focused on differential privacy. Much of this work has investigated the theoretical guarantees around different definitions and security models of differential privacy. Check out the workshop homepage for its informal proceedings and more.

In the world of secure computation, much work has been done in combining previously distinct protocols. Gazelle, Faster CryptoNets, TAPAS, and Slalom all improved numbers for running secure inference with homomorphic encryption — Gazelle and Slalom combined HE with Garbled Circuits and Secure Enclaves respectively, while Faster CryptoNets used work from machine learning on sparsity and quantization to encourage faster execution for the encrypted computations.

The fastest software-based methods for secure computation so far are still based on secure multi-party computation via garbled circuits and secret sharing. ABY3 combined secret sharing with garbled circuits and optimized the transition between these protocols, significantly improving on previous work. SecureNN further improved state-of-the-art numbers for several benchmarks by computing comparison-based operations like ReLU and MaxPooling via bit extraction and secret sharing with the help of a third-party crypto producer.

Finally, the fields of federated learning and secure enclaves have been converging this year. In particular, researchers seem to be working to enable data marketplaces for decentralizing the training of ML models. Notable work on secure enclaves include Slalom, Chiron, and Ekiden. In the case of Ekiden, this work is tied in with the blockchain research community aiming to decentralize machine learning and AI, so notions of distributed computation are generally addressed, while some even go so far as to present tokenization protocols and consider the incentives of all parties.

Code

A very special thanks to the researchers who managed to open source code repositories associated with this research. This isn’t always easy to do, and we love open-source!

First, Google has open-sourced several privacy-related projects, most notably within the Tensorflow GitHub organization. The Google Brain Security and Privacy team contributed a repository on differential privacy for using PATE with TensorFlow, building on work in the main repository from earlier in the year. Cleverhans, the adversarial security library, also made its way into the TensorFlow organization.

TensorFlow has also experienced plenty of ecosystem activity from the private ML community. Earlier in the year, coMind published a series of tutorials on federated learning in TensorFlow. Morten Dahl’s tf-encrypted project picked up steam with major contributions from Dropout Labs, with the goal of building a research platform for secure computation on top of the TensorFlow engine. It’s currently focused on multi-party computation, but plans for other protocols of secure computation are on the roadmap. Intel also released a codebase called he-transformer which takes advantage of their nGraph compiler to run TensorFlow scripts with a homomorphic encryption backend using Microsoft’s SEAL.

Outside of TensorFlow, the PySyft framework from OpenMined has been building a platform for private machine learning based on PyTorch. Other projects introduced this year for secure computation include cuFHE (fully homomorphic encryption on CUDA), Slalom (secure enclaves that outsource linear layers to an untrusted GPU using homomorphic encryption), and Keystone (an open source secure enclave).

Community

This year’s Privacy-Preserving Machine Learning workshop at NeurIPS saw a growth in attendance, submissions, and sponsorship. The workshop this year was sponsored by The Alan Turing Institute, Google, Microsoft, and Amazon. This represents a more general upward trend in interest and attention from the major tech companies. Major industry labs working on privacy-preserving technologies now include Visa Research, Vector Institute, Google Brain, DeepMind, Microsoft Research, Intel AI, and Element AI. Similarly, academic groups at Stanford, MIT, UC Berkeley and Penn State have pushed the state of the art private and secure machine learning research.

Startups have also formed around VC interest at the intersection of data privacy and machine learning. A Canadian VC, Georgian Partners, raised a US$550-million fund and developed their own platform for differential privacy to support their AI investments. Inpher and Leap Year Technologies both raised Series A funding in the 10M+ range, while Oasis Labs gained a stunning 45M Series A. A number of other ventures also formed this year, including our own Dropout Labs!

Meanwhile, the OpenMined community grew to over 3.5k members on Slack, with the PySyft project drawing over 2.5k stars on GitHub. Up and coming developers are excited about the technology and community, and are on the lookout for tools they can use to get started with private machine learning.

Looking Forward

2018 was a busy year for privacy-preserving machine learning, and there is no way we captured everything. Please let us know about any notable articles or projects and we’ll update our list!

As more organizations recognize consumers’ data privacy concerns, they’ll look for ways to make their current operations more private and secure. Privacy-preserving machine learning will be fundamental to that process. We hope you’ll join us for another impactful year for privacy in 2019.


About Dropout Labs

We’re a team of machine learning engineers, software engineers, and cryptographers spread across the United States, France, and Canada. We’re working on secure computation to enable training, validation, and prediction over encrypted data. We see a near future where individuals and organizations will maintain control over their data, while still benefiting from cloud-based machine intelligence.

Follow Dropout Labs on Twitter and tf-encrypted on GitHub.

If you’re passionate about data privacy and AI, we’d love to hear from you.