Learning multiple tasks with gradient descent

David Mack
Jan 17, 2018 · 7 min read

Learning multiple tasks is an important requirement for artificial general intelligence. I apply a basic convolutional network to two different tasks. The network is shown to exhibit successful multi-task learning. The frequency of task switchover is shown to be a crucial factor in the rate of network learning. Stochastic Gradient Descent is shown to be better at multi-task learning than Adam. Narrower networks are more forgetful.

Robot struggles to learn multiple tasks

Introduction

It’s often remarked about how machine learning struggles to simultaneously learn multiple tasks (“continual learning”):

The lack of algorithms to support continual learning thus remains a key barrier to the development of artificial general intelligence. [1]

I’m curious to see these problems first-hand, so in this blog post I’m going to run a series of multi-task learning experiments to demonstrate the limits and properties of today’s machine learning.

I’ll explore:

Experimental setup

I’m using a very simple network for these experiments:

The Tensorflow code used to generate these graphs and experiments is available as a FloydHub project and a github you can run yourself.

First of all, let’s look at how this network performs on two simple tasks:

Baseline: MNIST

MNIST is a popular toy dataset of hand-written digits. There are 60,000 labelled training images and 10,000 labelled test images. Given its popularity and familiarity, it’s a good task for this investigation.

Here’s how the network performs:

Our simple network performs adequately after 2,000 training cycles, achieving a test accuracy of 96% and after 25k cycles achieving 97% test accuracy. State of the art is close to 100%, however for the purpose of this investigation the network performs well enough.

Baseline: ImageNet Dogs

I’ve extracted ten dog breeds from the Stanford Dogs dataset (a subset of ImageNet) to train the network on. [# of test and train examples]

The network quickly manages to overfit the data (i.e. the training accuracy hits almost 100%) . The training does not generalise well: The test accuracy achieves 33% after 24k training cycles. This is better than random guessing (1/10 classes= 10%) however nowhere near state of the art (96% in CIFAR-10).

For this investigation of multi-task learning, the network’s performance relative to this baseline is more interesting than its absolute performance. Therefore this performance is adequate for our purposes.


Learning multiple tasks

Now let’s train and test the network against both tasks.

The network receives each MNIST greyscale image as a three channel image, two of which are zeros. Dog breed images are fed in as three channel images.

“id” is used to refer to the MNIST task (the task dictionary contains many variations of MNIST)

Let’s compare the multi-task training with the previous baseline single-task training:

/combined/GradientDescent/.*/tt2000/.*/test

Is multi-task learning improved by having a task signal?

Let’s now explore whether the network can perform better given a signal of which task is being performed (which could be given by another part of the network, and it is likely the human brain has such a signal).

The network is adjusted to now have four input channels. The first channel receives greyscale MNIST images (or zeros if not training that task) and channels two through four receive dog images (or zeros if not training that task).

Let’s perform the same training as earlier, for a task time of 5k steps, and compare with/without having this task signal:

channelplex = our new task signal scheme, combined = re-using the same channels for both tasks

(Later consider: experiment with having task specific dropout / 0 nodes / task one hot residual signal. check why dropout works e.g. does it factor into gradients and does it require uniform distribution)

Multi-task learning vs optimizer

Let’s see how the Adam optimizer performs versus Stochastic Gradient Descent on our two-task experiment setup:

Multi-task learning vs training time

It could be the case that the multi-task learning shown in the previous experiments is simply because the network does not have long enough to forget the previous task. That is, thanks to the slow learning rate, the network is averaging the training effect from each task. This effectively trains the network on a dataset that includes both tasks.

To test this suspicion, I’ll increase the time spent per task up to 20x longer. This experiment uses stochastic gradient descent and the task signal input format described ealier.

How time spent training each task affects test accuracy:

Putting the tasks onto timeline relative to task-switchover (each vertical gridline is one task training period) allows comparison of whether a longer task training period leads to forgetting more:

Multi-task learning versus size of hidden layer

[This is older experiment setup, needs re-tested, may just redact this]

Let’s test a two layer fully-connected network on the same tasks to see how first layer width affects task forgetting.

Bibliography

[1]: Overcoming catastrophic forgetting in neural networks

Octavian

Research into machine learning and reasoning

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store