Paper Summary: An Empirical Study of Example Forgetting During Deep Neural Network Learning (ICLR 2019)

Anthony Chen
4 min readApr 22, 2019

--

Authors: Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon

The problem of catastrophic forgetting, forgetting how to perform a previously learned task after learning a new task, has been fairly well studied. For a review of this in NLP, see this. But what about the learning dynamics when training for a single task?

This paper explores the learning dynamics that take place when learning a single task. They show that catastrophic forgetting can happen even in this setting. Even after a model has correctly learned to classify a training example, the model can undergo a “forgetting” event, in which through the course of training, the model eventually misclassifies the training example. The paper also shows that there exist “unforgettable” data examples, where once learned, never get forgotten.

Definitions

A forgetting event is said to have occurred for a data point x_i, if after t steps of SGD, x_i was classified correctly but after an additional step of SGD, at time t+1, x_i is missclassified after being classified correctly.

A learning event occurs if a data point was previously classified incorrectly but after a step of SGD, is now classified correctly.

A training example is “unforgettable” if after the training example has undergone a learning event, never undergoes a forgetting even during training.

Experiments

The authors analyze forgetting events on three datasets, MNIST, PermutedMNIST, and CIFAR10.

Ideally, after every step of SGD the author’s would calculate whether the classification of a training example has changed, however this is computational prohibitive. Instead, they calculate whether a training example’s prediction has changed only when it is in the mini-batch before undergoing SGD. This is a lower bound on the number of forgetting events.

Results

For each dataset, the training examples are sorted based on the number of forgetting events the example has undergone.

Unforgettable Events

To me, the most interesting finding is the large amount of datapoints that are unforgettable. For MNIST, permutedMNIST, and CIFAR10, 91.7%, 75.3%, and 31.3% of training examples are unforgettable. These percentages are the intersection of unforgettable images across five different seeds.

Visual Inspection

The authors show that unforgettable images are generally clear and unobstructed , whereas images that have undergone a forgetting event have some noise (e.g. occlusion, multiple objects, etc).

Continual Learning Setup

The previously mentioned results raise an interesting question, does catastrophic forgetting happen in the context of training on a single task? To answer this question, the authors sample 10K examples from CIFAR-10’s training set and randomly split this into two 5K partitions. They then train each partition in an alternating fashion and observe the training accurcies across the two partitions. As we can see in the following image, some amount of forgetting is occurring despite the data being drawn from the same distribution.

Training accurcies with a random partition. The background color is the partition that is being traing.

What are the forgetting dynamics in this continual learning setup across examples that do and don’t undergo a forgetting event? The authors then split the 10K examples into two partitions where one set has examples that we know from a previous training round never undergoes a forgetting event and one partition where examples undergo at least one forgetting event. The authors find that for examples that for the partitions that are never forgotten, the catastrophic forgetting phenomenon happens at a far lower magnitude.

Training accuracies with a partition based on if the example undergoes a forgetting event in a previous training run.

Removing Unforgettable Examples

The paper then examines the effect of removing unforgettable examples from the training set.

Effect of removing a percentage of the training set of CIFAR-10 when trained using ResNet18. The green line represents removing unforgettable examples. The blue line represents removing examples randomly. The vertical line indicates when all unforgettable examples have been removed. Note that the y-axis is the test accuracy.

They find that removing unforgettable examples has a minimal effect on generalization. Note that after all unforgettable examples have been removed and examples that have a forgetting event begin to be removed, the test accuracy begins dropping at a much more rapid pace.

Conclusion

There were a number of experiments that I didn’t cover. Overall, the empirical results were very thorough and opens a number of research directions. I wonder if forgetting dynamics could be used to analyze something generative like language modeling. Also, could you use forgetting statistics on-the-fly to reduce training time by ignoring unforgettable examples?

--

--