Learning multiple tasks with gradient descent

David Mack
Jan 17, 2018 · 7 min read

Learning multiple tasks is an important requirement for artificial general intelligence. I apply a basic convolutional network to two different tasks. The network is shown to exhibit successful multi-task learning. The frequency of task switchover is shown to be a crucial factor in the rate of network learning. Stochastic Gradient Descent is shown to be better at multi-task learning than Adam. Narrower networks are more forgetful.

Robot struggles to learn multiple tasks

Introduction

It’s often remarked about how machine learning struggles to simultaneously learn multiple tasks (“continual learning”):

  • Can today’s architectures learn multiple tasks?
  • How does task duration affect task forgetfulness?
  • Are there any simple ways to reduce task forgetfulness?
  • Which optimiser forgets tasks the least?

Experimental setup

I’m using a very simple network for these experiments:

  • Two hidden layers: 1 convolutional layer, 1 fully connected layer
  • ReLU activation, max pooling, batch normalisation and dropout are used
  • Output: Softmax classifier over 10 classes
  • Stochastic Gradient Descent is used to train the network with a learning rate of 0.1

Baseline: MNIST

MNIST is a popular toy dataset of hand-written digits. There are 60,000 labelled training images and 10,000 labelled test images. Given its popularity and familiarity, it’s a good task for this investigation.

Baseline: ImageNet Dogs

I’ve extracted ten dog breeds from the Stanford Dogs dataset (a subset of ImageNet) to train the network on. [# of test and train examples]

Learning multiple tasks

Now let’s train and test the network against both tasks.

  • We’ll measure the training accuracy of the current task
  • We’ll measure the test accuracy against each task and of performing all tasks (which is equivalent to the average of the individual task test accuracies)
“id” is used to refer to the MNIST task (the task dictionary contains many variations of MNIST)
  • The network partially forgets each previous task
  • Test accuracy improves over time
  • The network progressively forgets each task less each switchover until almost not forgetting happens
  • It appears the network learns to co-habit the two tasks’ training over time. This could be because only the co-habitable training weights survive each task switchover, leading to sort of natural selection of those training weights.
/combined/GradientDescent/.*/tt2000/.*/test
  • Contrary to the general consensus, a simple machine learning model using SGD is able to learn multiple tasks quite well
  • The task switchover may act as an injection of noise into the system that helps it explore more maxima, resulting in the better test accuracy for MNIST.

Is multi-task learning improved by having a task signal?

Let’s now explore whether the network can perform better given a signal of which task is being performed (which could be given by another part of the network, and it is likely the human brain has such a signal).

channelplex = our new task signal scheme, combined = re-using the same channels for both tasks
  • Having a task signal improves Dog breed training
  • However, having a task signal seems to lead to faster forgetting of dog breed test accuracy.
  • Overall, with a task signal the combined test accuracy is higher early on but later performs no better. [todo: make this more rigorous]

Multi-task learning vs optimizer

Let’s see how the Adam optimizer performs versus Stochastic Gradient Descent on our two-task experiment setup:

  • However, this leads to over-fitting of the current task, to the detriment of per-task test accuracy and overall test accuracy
  • SGD forgets previous tasks less than Adam, and subsequently achieves higher accuracy on all task tests

Multi-task learning vs training time

It could be the case that the multi-task learning shown in the previous experiments is simply because the network does not have long enough to forget the previous task. That is, thanks to the slow learning rate, the network is averaging the training effect from each task. This effectively trains the network on a dataset that includes both tasks.

  • The network does not forget more when it is trained longer on each task
  • Therefore, the amount of learning (i.e. test accuracy) is relative to task switchover frequency rather than number of training steps
  • Therefore network inherently performs some multi-task learning

Multi-task learning versus size of hidden layer

[This is older experiment setup, needs re-tested, may just redact this]

  • This is intuitive, since there is less room to remember each tasks’ specific computations, and therefore the training must over-write the previous task to remember the current task

Bibliography

[1]: Overcoming catastrophic forgetting in neural networks

Octavian

Research into machine learning and reasoning

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store