Transfer Learning Overview

The journey of Transfer Learning starts long ago but one of the most interesting milestone is the following paper

(Interesting) Facts

The great strength of CNN architectures is their capability to automatically learn a hierarchy of feature detectors in order to solve some task.
An highly semantic task like a classification problem, can be thought as a mapping

  • from the high dimensional input space (image space)
  • to the low dimensional semantic space (finite discrete space of possible labels)
Feature Detector can be considered as a Mapping from the Input Space (typically high dimensional) to a Semantic Space (typically lower dimensional). Because of the Feature Locality Principle, the spatial dimension is reduced to get on the semantic dimension

The CNN Architecture leverages some inductive bias, like removing the reducing dimension (Spatial Pooling) as the layers increase, which makes sense considering the final representation should not depend on spatial dimension, in fact we would like it to be as invariant as possible to spatial transformations (and not only to these ones)

What it has been observed is that regardless of the architecture, the dataset and the target semantic space (and of course the initialization assuming it’s random), the first layers seem to always converge to specific kinds of feature detectors: the Gabor Filters. It has also been observed they seem to be present in the Human Brain at the first stage of Visual Processing (V1)

Gabor Filters Source:https://openi.nlm.nih.gov/detailedresult.php?img=PMC3224537_1475-925X-10-55-4&req=4

This is actually a very interesting and important phenomenon as it seems to suggest the Gabor Filters are the most efficient way to start the semantic extraction process from an image.

It is worth observing an important hyperparameters of this methodology is where to perform the A/B split

It would mean Gabor Filters block is a sort of “generic building block” which could be used to design NN aimed at solving computer vision problems

This is one of the main goal of Transfer Learning: finding “building blocks” which can be composed to build a NN and fine tuned, instead of trained from scratch, on the Dataset

A Theory and New Questions

Looking at the whole processing pipeline we know now the first layers perform some kind of context agnostic processing while last layers will have to perform some context specific processing (with context including task, dataset, …) so this theory raises the following questions (quoting from the paper)

  • Can we quantify the degree to which a particular layer is general or specific?
  • Does the transition occur suddenly at a single layer, or is it spread out over several layers?
  • Where does this transition take place: near the first, middle, or last layer of the network?

Essentially the rest of the paper proposes a methodology to answer these questions and presents some results

Why are these questions important

Being able to properly understand how the CNN specializes while training is important to get to Transfer Learning : “transferring” the network “capability of solving a problem” which means basically adapting its weights properly, to another similar problem in a Data Efficient Way

The data efficiency is in fact one of the most important aspects of transfer learning : it is well known that Supervised Learning is an effective way to make a certain, typically big, NN become able to solve a problem but it scales badly in terms of data as it typically requires A LOT OF supervision signal which, in case of manual annotation, is expensive to collect as it relies on humans to provide it.

Furthermore the more the task is difficult, the more the annotations need to be provided by human experts instead of normal people and the former one’s time is more expensive than the latter ones

Methodology

The strategy used to quantify the generality of a feature detector relies on transplanting the trained feature detector into another pre-trained network, substituting a certain block

If no relevant performance drop is observed, then it is possible to assume the transplanted block is quite generic

Let’s assume 2 DNN (NN_{A} and NN_{B}) have been trained on 2 different Datases, A Dataset and B Dataset respectively. In order to measure how much the subnetwork f_{1}^{A} is generic let’s transplant into B in a proper position (as it is an Input Processing Subnetwork it has to become the initial layers) and let’s measure the performance drop of “transgenic” A-B NN on the B Dataset

It is worth observing an important hyperparameters of this methodology is where to perform the A/B split

--

--