Privacy-Preserving Machine Learning 2019: A Year in Review

Published in

Cape Privacy (Formerly Dropout Labs)

11 min readJan 10, 2020

Last year’s post discussed what made 2018 a breakout year for privacy-preserving machine learning (PPML). That momentum carried over into 2019, as research and enterprise interest in PPML shows no signs of slowing down.

In this post, I’ll highlight some of the top news, research, code, and community events that impacted PPML in 2019. We’ll cover advancements in differential privacy, federated learning, and secure computation, along with an overview of new projects and investments from academia, startups, and large tech companies like Google, Facebook, and Microsoft.

If 2018 was the year that the AI community at large became aware of itself, 2019 has shown that it was also the year we decided to do something about what we found. A variety of new workshops at the major ML conferences focused on beneficial applications of AI, including AI for the developing world, AI for social good, and AI for climate change. Attendance for the Fairness, Accountability, and Transparency (FAT*) conference doubled and is expected to swell again in 2020. Attention toward the safety, privacy, security, fairness, and robustness of machine learning has expanded significantly. Though the community has worked hard to expose and overcome its own biases, prejudices, and misconduct, there’s still plenty of work to be done.

Why is privacy so important for machine learning? If the field of machine learning is set to revolutionize industries in the ways many expect, it will need massive amounts of data. Two of the highest barriers to such a revolution would be the high costs of both accessing and operationalizing that data. Many in our PPML community have maintained that improved privacy and security infrastructure for machine learning will be necessary to overcome these barriers. This is especially relevant for sensitive datasets that could, for example, accelerate the discovery of life-saving treatments, or help diagnose and correct prejudiced behaviour in existing systems. By building the infrastructure to enable secure and privacy-preserving access to data, the PPML community can create a beneficial and equitable future for machine learning in society.

News

The year started off with a bang: Shoshana Zuboff released “The Age of Surveillance Capitalism”, a culmination of her nearly decade-long effort to describe the emergence of a massive new market for operationalizing and commodifying consumer behavioural data. While many may find the exposition dense and the claims radical, it has been lauded by critics as an impressive achievement for modern socio-technological research. Future research on the sociology, politics, and economics of modern data science practice will likely either respond to this work or risk obscurity. Meanwhile, the Privacy Project’s numerous pieces of investigative journalism have provided further context for Zuboff’s work, demonstrating truths about how commonly our private information is traded in the industry.

On the legal side of things, big tech felt the pressure of 2018’s GDPR going into effect, with both Google and Facebook dealing with hefty fines and lawsuits. Further privacy regulations have been discussed and proposed elsewhere in the world, with the California Consumer Privacy Law being a major example that’s already gone into effect as of January 2020. As with the ongoing development of GDPR best practices, companies and legal experts are finding translation between regulatory text and practical implementation steps to be difficult. How regulators will interpret and enforce the law is yet to be seen.

But perhaps the noisiest news in the world of private ML has been from Facebook and its controversial new commitment to privacy. There were scattered signs of this step throughout the year:

a blog post from Zuckerberg outlining a new, privacy-central vision for Facebook
the controversial Libra cryptocurrency announcement
the company’s sponsorship of the Udacity Secure & Private AI course & its related scholarships
the announcement of new PyTorch-native tools for encrypted and federated learning at the PyTorch Dev Con
the PyTorch team’s sponsorship of a number of OpenMined development grants
hiring a wave of PPML experts

There had been so much of this activity throughout the year that, by the time “The Great Hack” was released on Netflix, the company’s privacy focus had already been thoroughly discussed within the privacy community. This development will have important implications for the field, so this is definitely worth watching.

Research

Papers referencing privacy + ML jumped in 2019

As anticipated, the machine learning research community’s interest in privacy has continued to evolve. Last year’s research emphasized combining cryptographic methods in PPML to improve efficiency, and this continued into the new year. New research can be broadly classified into two categories: (1) improved methods for PPML, and (2) improved applications of PPML.

During this year’s Privacy in ML workshop [1], I asked a question about an apparent gap between the theory and practice of PPML, and the consensus among experts is that the gap is huge (see: 1:19:20 in this video). There is a concern that the research community is working on toy versions of known important problems, but that we’re mostly overlooking problems most critical to improving adoption of PPML tech in industry.

It seems clear that the breadth of research problems will expand dramatically as various labs and companies work to engineer privacy into existing ML workflows. While there’s still much work to be done, there’s hope that this continued collaboration with industry partners will allow for joint development of pragmatic solutions. Meanwhile, much work has gone into the well-known problems, and there’s been significant progress on long-standing issues.

Differential Privacy

I’m particularly keen on definitions of differential privacy (DP) and new primitives for DP algorithm design. Shuffling is one such technique, in which one simulates randomly shuffling data between users before centralizing it. It’s become an interesting new model for “distributed DP”, offering improved privacy over standard DP, but improved utility over local DP. Such improvements are particularly important in machine learning after recent impossibility results showing that pure, local DP can never be sufficient for learning from non-trivial datasets.

Gaussian differential privacy is another interesting development. We often describe differential privacy informally as adding enough noise in an algorithm that an attacker can’t notice any significant difference between neighbouring database queries. The language suggests formalizing the DP criterion as a formal hypothesis test from the perspective of the attacker, and prior work has demonstrated this is not only possible but useful. Gaussian differential privacy preserves this interpretation while adding benefits over classic DP related to mechanism composition. Furthermore, initial work extending this to training deep neural networks shows that the Gaussian DP definition leads to improved utility.

Finally, I’m excited about innovations in works applying differential privacy in novel data science and machine learning scenarios. One example is this ICLR acceptance from Carnegie Mellon University extending differentially private deep learning to the meta-learning frameworks of MAML and others. The extension shows significant improvement over the standard DP-SGD and DP-FedAvg algorithms while also expanding the techniques to more realistic learning scenarios.

Another interesting new application showed up in this year’s ML with Guarantees workshop for analyzing private datasets of unstructured text. Their idea was to finetune a language model like GPT-2 with a differentially private optimizer. The language model is then seeded with a prompt to generate semantically meaningful text from the same distribution as the private dataset. The sampled text is guaranteed to satisfy the differential privacy constraint, hopefully still capturing higher-level patterns in the private data. While the results are preliminary, works like this cast new light on the kinds of overlooked problems that are critically important for adoption of PPML in industry.

Federated Learning

This year’s federated learning (FL) workshop at NeurIPS [2] was not strictly focused on PPML research, as improved privacy can be a byproduct of federated learning just as often as it can be a goal itself. Nevertheless, nearly all of the federated learning work can be repurposed for privacy-focused applications, and it’s one of the fastest-growing areas of research to keep an eye on for new PPML work. I live-tweeted the workshop, and the talks were also recorded (see [2]). Here are some of my highlights.

This year I’ve been excited by work that generalizes PPML methods to the meta-learning setting, and the workshop included several novel works in this direction for federated settings. Meta-learning is useful for FL because federated data is often non-i.i.d. (or not “independent and identically distributed”), which throws a wrench into the assumptions of most modern machine learning algorithms. The award for best paper went to research applying domain adaptation to this problem with impressive results, and another paper showed that MAML & related first-order meta-learning algorithms are also competitive.

Google continues to be a leader in the federated learning space since their introduction of the term in 2016, and several of the invited talks focused on how they do federated learning in production. In my opinion, Google’s system for FL is by far the most successful publicly-known deployment to date. Their talks usually centre around the research they’ve done to improve the efficiency, privacy, and capabilities of their system to better enable working with federated data. Some of my favourite work focuses on how to reproduce existing aspects of the data science workflow in these new contexts, and they gave a great example with their work on training differentially private generative models on federated datasets. The key idea is identical to the paper mentioned before on private text data, however they use GANs to handle alternative settings where self-supervision with text data isn’t possible.

Secure Computation

Quite a bit has happened on the border of secure computation and machine learning. As a reminder, the main lines of work in this area are based on secure multi-party computation with secret sharing or garbled circuits, homomorphic encryption, and secure enclaves. My prediction from last year was that researchers would be working to combine these techniques in an attempt to make them more practical, which turned out to be true. CrypTFlow is perhaps the most notable example of this — the authors use secure enclaves in combination with a secret-sharing scheme to scale up secure neural network inference to ResNet-sized models on ImageNet-scale data, yielding an impressive improvement in the state-of-the-art.

There was also interesting work on the classic problem of training linear models. Helen and a concise note from Jon Bloom showed alternative ways of training linear models at approximately plaintext speed, and CodedPrivateML showed a way to further close the gap with a custom protocol trading off the strength of the privacy guarantee for improved training convergence.

While these foundational problems of the field are always interesting and worth advancing, I’m most excited by improvements that have come from integrating machine learning techniques in a privacy and protocol-aware manner. This work from Aarhus University & Data61 shows that the now-standard practice of quantizing neural network weights for inference before secret sharing them can yield strong improvements in encrypted inference speed without sacrificing accuracy. Another paper from a group at Hebrew University of Jerusalem showed that, by training with MPC in mind, one can modify neural net architectures to improve communication and round complexity by >50%, again without sacrificing accuracy. While this line of work is in its infancy, I’m expecting exciting advancements further closing the efficiency gap between plaintext and encrypted ML over the next year.

Code

The Privacy teams at Google released two new libraries this year, one for doing differentially private data analysis at scale, and one for private join and compute. Releases like this tested at scale are few and far between, so even if they’re not specific to machine learning they’re still worth mentioning.

The biggest PPML code release of this year was likely the TensorFlow Federated release. According to their blog, the code was heavily influenced by their experiences with federated learning in production. The team at Google hopes it will become the industry-standard platform for experimenting with new federated learning research. After diving into the API, it looks quite promising! But there were a number of lesser known code releases for FL as well this year, including the Substra Foundation’s containerized & traceable FL framework, as well as the FATE project from WeBank.

Meanwhile, the OpenMined contributors made a big push to demonstrate federated learning in PySyft this year, with much of this material finding its way into the Secure & Private AI course. Several other OpenMined projects around federated learning and encrypted ML have spun up throughout the year as well; and the community has grown considerably in the last year.

In addition, there were quite a few new releases related to secure computation. CrypTen was perhaps the biggest surprise for many — a new library for encrypted machine learning in PyTorch from Facebook AI. We were happy to see a number of similarities to our work on TF Encrypted, with some interesting improvements around non-linearity approximation.

Speaking of TF Encrypted, the core team has been hard at work expanding our initial release from last year. We’ve introduced TF Trusted (running TensorFlow graphs in secure enclaves), TF SEAL (a TensorFlow bridge to Microsoft’s SEAL library), and TF Big (a big-int implementation in TensorFlow). We also introduced a Keras implementation into TF Encrypted, which was featured in Udacity’s Secure & Private AI course. In looking ahead at the needs of next generation PPML frameworks, we’re excited to see what the year will bring!

Community

Our community has grown quite a bit in a year, often driven by big tech’s interest in PPML. Early in the year, Google Cloud and Intel partnered to sponsor the Google Confidential Cloud Computing Challenge for private and secure cloud computing. There were a number of awesome submissions, along with our winning one based on TF Trusted. We were also excited to see the Alibaba Gemini Lab take first place in the the ML training track of the iDASH competition with their submission built on TF Encrypted!

The OpenMined community partnered with Udacity to develop the Secure & Private AI course, and also worked with RAAIS to provide several open source development grants. Facebook played a part as well, by providing scholarships to students for the Secure & Private AI course and funding several additional open-source grants in OpenMined.

A group of academic, government, and industry partners working on homomorphic encryption convened in August to continue development toward a set of standards for future work. While it’s still early days for such standards, the evolution of the field depends on collaboration between such diverse sets of stakeholders.

Finally, Owkin has launched a major effort to deploy federated learning between hospitals, through both the Substra Foundation as well as a partnership with NVIDIA and King’s College. Since there are few examples of federated learning in the wild, the community will be closely following developments like this over the next year.

Looking Forward

Although we can’t include all of our favorite works from the year, please let us know in the comments if we’ve made any glaring omissions!

This year was all about continuing to advance the groundbreaking work in PPML from 2018, as well as beginning to use PPML to solve real-world use cases. We’ve discovered a gulf between the research and practice of PPML, unearthing a variety of new problems as we aim for industry adoption. Ultimately, I’d encourage further collaboration between PPML researchers and the variety of organizations encountering security and privacy problems on their machine learning journeys. Whether you’re one of the former or the latter, we hope you’ll consider dropping us a line!

About Dropout Labs

We’re a team of machine learning engineers, software engineers, and cryptographers spread across the United States, France, Germany and Canada. We’re building a platform for managing data privacy within machine learning pipelines, enabling compliance and security teams to enforce policy while reducing friction for data scientists to accessing the data they need. By applying secure, encrypted, and privacy-preserving machine learning techniques, our platform unlocks more valuable data, and enables the creation of better models.

Visit our website or blog or product page for more information, or follow us on Twitter for up-to-date announcements.

If you’re passionate about data privacy and AI, we’d love to hear from you.