image credit: Luis Miguel Bugallo Sánchez, https://commons.wikimedia.org/wiki/File:Dog._Galiza.jpg

Teaching an old dog new tricks: Transfer learning in deep neural networks

Peter van der Putten
4 min readNov 25, 2016

--

When you need to learn a new task, it can help if you already have experience in a different but similar task. The same holds for deep neural networks, a class of machine learning methods that recently became famous for beating the best human Go players, producing hallucinating paintings, recognizing doodles or generating Trump-style Tweets. In machine learning, this idea of transferring knowledge from one domain to another is called transfer learning.

Now surely, if you have enough time to learn (and thus seen sufficient examples), the value of transfer learning should diminish — you could just as well have learned from scratch. In our recent research we have investigated whether this actually holds for deep learning [Soekhoe et al, 2016]. Building on some prior work we also investigated whether it would be best to let the entire network adapt, or perhaps lock down the lower layers [Yosinski et, 2014].

We tested this as follows. We used two different data sets, one with different categories of objects (a, b, c in the image below, taken from the Tiny-Imagenet data set). Networks were first fully trained and tested on one set of objects. For example after training networks on examples of lighthouses, butterflies etc. (in total 100 different image classes with 500 examples each), the networks were then tested by letting these classify new test images out of these categories. Next, the networks were given another task: they had to learn to recognize new classes of objects with more limited number of examples (e.g. testing for different conditions, for example only 50, 100, 200 etc umbrellas; in total 100 new image classes). This experiment was repeated with images depicting certain scenes or settings (d, e, f — Mini-Places2).

Following Yosinski et al [2014], we also wanted to know whether all elements of the network would be updated or just the higher level layers. Despite the recent deep learning hype, neural networks already exist for decades, and are remotely inspired by how the brain is organized. Information feeds in at the lowest level (left in the image below), and the activation of the various ‘neurons’ feeds forward through the network, relative to how strongly the neurons in the various layers are connected. Based on the training feedback the weights of the network will be adapted from the top layer back down again, to tune the network towards a better outcome. As a result, the various layers will learn to detect features of increasing granularity, layer by layer. You could think that when the network needs to learn a new task, for instance recognizing umbrellas, the optimal approach would be to adapt all layers of the network. But it could also be that actually by fixing the weights of the lower layers (i.e. keeping lower level features the same) you will make it easier for the higher level layers of the network to adapt itself; because the lower level features are generally more reusable.

Below you can see some of the results for the Tiny-Imagenet data set. The baseline (‘base’ on the x-axis) indicates networks that were learned completely from scratch, i.e. no transfer learning resue of what was learned from the other tasks. FTall stands for reusing the network that was learned on the old task, and fine tune all layers in the network. Finally, SxT stands for reusing the network from the old task, but fixing subsequent layers in the network (layer 1, layer and 2 etc). The number stands for the overall number of images that were available. Accuracy rates are plotted on the y-axis (percentage test images correctly classified).

What do these results show? Firstly, the less data for the new task is available, the higher the benefit of reusing the networks learned on the other tasks is as a starting point (i.e. ‘base’ gets beaten more often for the lower data set sizes). We also see a pattern that when transfer learning makes sense, i.e. when less than 500 examples for the new classes are available, the results are best if we fix the first two or three layers of the network.

For detailed results see the paper, but what these experiments have demonstrated that indeed you can teach an old dog some new tricks — as long as there is not much data available for the new classes it makes sense to start with what you have learned from the other tasks. And also that it makes sense to adapt some of the deeper, higher level structures in your old dog brain, and leave the more fundamental levels alone.

This post was also published as a LinkedIn article.

References

Deepak Soekhoe, Peter van der Putten and Aske Plaat. On the Impact of Data Set Size in Transfer Learning using Deep Neural Networks. In: Fifteenth International Symposium on Intelligent Data Analysis (IDA), 2016.

Camera ready preprint here.

Note: Deepak Soekhoe was the principal researcher for this work

Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328, 2014

--

--