Machine Unlearning: Fighting for the Right to Be Forgotten

Synced
Synced
Feb 5 · 5 min read
Image for post
Image for post

Data protection and privacy have been discussed nonstop as more and more people come to realize just how much personal information they are sharing through the countless apps and websites they regularly visit. It’s no longer so surprising to see products you’ve talked about with friends or concerts you’ve searched on Google promptly appear as advertisements in your social media feeds. And that has many people concerned.

Recent government initiatives such as the EU’s General Data Protection Regulation (GDPR) are designed to protect individuals’ data privacy, with a core concept being “the right to be forgotten.”

The bad news is, it’s generally difficult to revoke things that have already been shared online or to properly delete such data. Facebook for example recently launched an “Off-Facebook Activity” tool — previously called “Clear History” — which the company says enables users to delete data that third-party apps and websites have shared with Facebook. But as the MIT Technology Review notes, “it’s a bit misleading — Facebook isn’t deleting any data from third-parties, it’s just de-linking it from its own data on you.”

Machine learning (ML) is increasingly viewed as exacerbating this privacy problem. Data is the fuel that drives ML applications, and this can include collecting and analyzing information such as personal emails or even medical records. Once fed into an ML model, such data can be retained forever, putting users at risk of all sorts of privacy breaches.

Switching to a researcher’s perspective, a concern is that if and when a data point is actually removed from an ML training set, that may make it necessary to retrain downstream models from scratch.

In a new paper, researchers from the University of Toronto, Vector Institute, and University of Wisconsin-Madison propose SISA training, a new framework that helps models “unlearn” information by reducing the number of updates that need to be computed when data points are removed.

“The unprecedented scale at which ML is being applied on personal data motivates us to examine how this right to be forgotten can be efficiently implemented for ML systems,” the researchers explain in the paper Machine Unlearning.

Having a model forget certain knowledge requires that some particular training points be made to have zero contribution to the model. But data points are often interdependent and can hardly be removed independently. Existing data also continuously works with newly added data to refine models.

One solution is to understand how individual training points contribute to model parameter updates. But as previous studies have shown, this approach is only practical when the learning algorithm queries data in an order that’s been decided prior to the start of learning. So if a dataset is queried adaptively — meaning a given query depends on any queries made in the past — this approach becomes exponentially more challenging and thus can hardly scale to complex models such as deep neural networks.

The researchers therefore proposed a framework called Sharded, Isolated, Sliced, and Aggregated (SISA) training, which they propose can be implemented with minimal modification to existing pipelines.

During SISA training, the training data is first divided into multiple shards so that each training point is included in only a small number of shards — ideally a single shard. Models are then trained in isolation on each of these shards, which limits the influence of any one data point on the models trained on shard(s) containing that point. Finally, when a request to unlearn a training point is made, only the affected models need to be retrained. This process also decreases the retraining time to achieve unlearning because each shard is of course smaller than the entire training set.

Each shard can also be further divided into slices which can be presented incrementally during training. The researchers save the state of model parameters before introducing each new slice, which allows them to start retraining from the last known parameter state that does not include the point to be unlearned. Slicing further contributes to the large decrease of time required for the model to unlearn data.

Image for post
Image for post

The researchers evaluated SISA on two datasets from different application domains. Results show that by sharding alone the framework speeds up the retraining process by 3.13 times on the Purchase dataset and 1.66 times on the Street View House Numbers dataset. Additional speed-up can be achieved on both sets with further slicing, according to the paper.

By demonstrating SISA’s ability to speed up model unlearning and to generalize in different scenarios, the researchers hope to provide solutions for practical data governance in ML and to help relieve growing personal data concerns.

The paper Machine Unlearning is on arXiv.

Journalist: Yuan Yuan | Editor: Michael Sarazen

Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

Image for post
Image for post

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Image for post
Image for post

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

Image for post
Image for post

SyncedReview

We produce professional, authoritative, and…

Synced

Written by

Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global

SyncedReview

We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Synced

Written by

Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global

SyncedReview

We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store