End to End Incremental Learning

AI Club @IIITB
AI Club @IIITB
Published in
7 min readAug 20, 2018

This article discusses ECCV-2018 “End to End Incremental Learning” paper.

Following is the outline of the article:

  1. Incremental Learning.
  2. Catastrophic forgetting.
  3. Representative Memory.
  4. Model Architecture.
  5. Loss function.
  6. Training.
  7. Results.

Incremental Learning:

Let’s say there are 100 products in a super market and the owner of the super market asked you to build a model to classify products in the super market which can be used to automate cash counter.

Now you have built a deep-learning model to classify the products using all your GPU resources with 98% accuracy. (Let’s call this model A)

The owner of the super market is happy in seeing the result and you are finally happy as your model is in production stage.

Well the problem starts after a MONTH!!

After a month the owner bought 10 new products and came to you to update the model.

Now After one more month the owner bought 20 new products and again came to you to update the model.

This process of adding new classes (here products) to existing dataset and updating the model is called “Incremental Learning”.

Catastrophic forgetting:

So how do you update the model after adding new products every month?

One way is each time you add new products you update the existing softmax layer with new number of classes and re-train the model with new dataset (in our case 10 new products dataset) which does not include old dataset containing 100 products.

But problem with this method is the model will now give good accuracy on newly added 10 products but the accuracy decreases on old 100 products. So basically the model is forgetting information it learned previously.

This process of forgetting about original products (or classes) is called “Catastrophic forgetting”.

The other way of updating the model is to combine old (100 products) dataset and new (10 products dataset) and build model from scratch on the new 110 products dataset. This method looks good but each time new products are added you need to update the dataset and train model again which is computationally expensive.

So the paper discusses a approach to learn deep neural networks incrementally, using new data and only a small subset of old dataset.

This is based on a loss composed of a distillation measure to retain the knowledge acquired from the old classes, and a cross-entropy loss to learn the new classes.

So, what is this distillation measure to retain the knowledge acquired from the old classes?

The paper “Distilling the Knowledge in a Neural Network” by Geoffrey Hinton discusses about how you can transfer knowledge from Cumbersome model to small model. They introduced a term “Distillation Loss” to retain the knowledge acquired from Cumbersome model.

You can checkout this blog post which discusses in detail about the paper.

we recommend you to read the blog post if you haven’t read “Distilling the Knowledge in a Neural Network” paper before.

Representative Memory:

For building incremental model the authors created new dataset which consists of new data corresponding to the new classes (10 products here) to be added into the model and only a small exemplar set of samples corresponding to the old classes (100 products here). This subset of old dataset is called Representative memory.

Wait a second, so does that mean the subset of old (100 products) dataset is called Representative memory?

The answer is “NO”.

Let’s say now you have dataset which contains information about 110 products and you got new dataset from 20 new products. Now you need to update the model again by creating new dataset which consists of full new dataset of 20 new products and subset of 110 products dataset. In this case subset of 110 products dataset is called “Representative Memory”.

So your representative memory always gets changed as you update the model incrementally.

Two approaches are used to create Representative memory.

Static Memory:

In this every class will have lower bound [memory / no of classes] number of images. Whenever a new class is added as per the formula lower bound [memory / no of classes] some images from the previous classes will be removed and some for new class will be added.

E.g Suppose you have 5 classes initially and memory can handle 50 images so we will have lower bound [50 / 5] = 10 images from each class in the set. Now say we add 2 more classes and after the incremental training process we will have the total of 7 classes so lower bound[ 50 / 7] = 7 so for the old 5 classes we will remove 3 images and for the new 2 classes will add 7 images.

Dynamic Memory:

In this we will store a constant number of images per class. Thus, the size of the memory grows with the number of classes.

How should we select particular set of images that represent a class?

To select new samples for Representative memory we want n images from each class that are most representative of the class. To find the n images for every class we take the mean of all the image in a class and based on the distance of image for a class and its mean we find a factor x. Based on the factor x we sort the images for each class and pick first n images for every class.

How should we remove samples from representative memory?

Since the samples are already stored in a sorted list we just need to remove the required y numbers of images from the end of the sample set of each class.

Let’s now discuss about how to update the model incrementally.

Model Architecture:

The above image gives intuition about how you incrementally update the model.

The model contain two type of blocks.

  1. CLi blocks: These blocks correspond to existing model built on old classes. (Grey colour classification layers).
  2. CLN blocks: This block correspond to newly added classes. (Green colour classification layers)

You need to add new CLN blocks incrementally each time you want to update the model with new classes.

The output from grey classification layers contain logits from old classes, while green classification layer contain logits from new classes.

Feature extractor (Let’s say AlexNet) consists of layers excluding the last one fully connected layer.

For a given input image the feature extractor produces a set of features which are used by the classification layers(last fully-connected layer) to generate a set of logits.

Loss function:

Cross-distillation Loss = Cross-entropy loss + distillation loss

Lc(w) -> Cross-entropy loss.

Ld(w) -> distillation loss. F is the set of old classification layers.

Cross-distillation loss function includes distillation loss for all the classification layers of the old classes and classification loss for old and new classes.

As told earlier the notion of applying distillation loss for older classes is that it will help in retaining some knowledge for these classes.

Training:

Suppose we have a model which is already trained for 100 classes and we have representative memory for these classes. Now 10 new classes need to be added into the model for which we have new samples. So we create training dataset will consist of images from 110 classes for updating the model. On this training dataset, we will apply data argumentation to generate trainable dataset and use this set to train our modified network which is the old network with new classes added in the last layer. Following this process, we will perform balanced fine tuning because the new model we trained is on samples from old classes available for training which can be significantly lower than those from the new classes. This creates unbalance in the training set. To deal with this the authors added an additional fine-tuning procedure. For fine-tuning, the new training subset contains the same number of samples per class, regardless of whether they belong to new or old classes. And then update memory depending on whether we are using static or dynamic memory.

Results:

ImageNet:

As per the graph, we can see that on incrementation of 10 classes in every step( figure (a)) or on the incrementation of 100 classes on every step( figure(b)) Our-CNN is model outperformed all the previous approaches.

CIFAR-100:

As per the graph, we can see that on incrementation of 2 classes in every step( figure (a)) or on the incrementation of 5 classes on every step( figure(b)) Our-CNN is model outperformed all the previous approaches.

Authors for the article:

  1. Ravi Theja: https://www.linkedin.com/in/ravidesetty/
  2. Utkarsh Agarwal: https://www.linkedin.com/in/utkarsh2610/

--

--

AI Club @IIITB
AI Club @IIITB

We are a bunch of Deep Learning enthusiasts, with the sole goal in mind: To Learn.