What is Transfer Learning all about?

9 min readNov 20, 2020

Transfer Learning is one of the most widely researched and utilized technologies in deep learning today. Today we all know the worthiness of Convolutional Networks. But at the same time, we also know that the most robust networks today contain a vast number of layers which we can’t computationally afford training again and again. Lucky for you, you don’t have to! Transfer learning allows you to utilize those existing models and fit them into your own data.

We all know that we don’t learn everything from scratch and computers don’t too! We have a tendency to apply pre-existing knowledge to new concepts and paradigms and transfer learning follows this exact ideology. Researchers have claimed how big of a step transfer learning really is towards Artificial General Intelligence. Transfer Learning is a very basic concept and all it does is use the SOTA algorithms with 100s of layers and the weights and biases they have learned as the basis for our dataset. Thus the ‘knowledge’ these algorithms acquire is implemented on novel datasets. Now, where do we get these algorithms from? Every year, the ImageNet challenge runs wherein top researchers compete to classify millions of images in 1000s of different categories and come up with the most accurate algorithms. In the past, the challenge has given rise to algorithms like ResNet, Inception, and VGG which are state-of-the-art today. Now, transfer learning doesn’t mean using these algorithms to classify our objects because in extreme class-specific cases they will fail spectacularly. We use the patterns that these algorithms determined with their hidden layers as our building blocks and utilize them to fit our training data. Perhaps this image will clear it up:

Now, this was a high-level overview of transfer learning. Let’s get into what it means formally and mathematically. Please note that this MIGHT BE BORING but SUPER IMPORTANT!

In a paper called Survey on Transfer Learning which is perhaps one of the best-documented papers out there, the authors used ‘domain, task, and marginal probability’ to define a good framework for transfer learning.

Some definitions they clarified:

Domain — A two-component set consisting of a feature space X and the marginal probability P(X).

Task — Again defined by two components: a label space Y and a predictive function learned from feature/label pairs.

The paper utilized these assumptions to define transfer learning as:

“ Given a domain D = {X,P(X)} , a task T contains a label space Y and a conditional probability distribution, P(Y|X)that is typically learned from the training data consisting of pairs Xi is an element of X and Yi is an element of Y”.

With this, we have sufficient initial knowledge of transfer learning. Let’s understand why we do it? In a typical case scenario of supervised learning, the availability of quality and reliable labeled data is sparse and training from the scratch often results in a severe lack of generalization. Transfer learning is designed for such a situation. We can use weights and biases from a task that a certain algorithm performed well to extract features from our new task. Now, you might question how might that work? Well since images have tons of common features like curves and edges to a certain extent transfer learning makes sense, however, there is research going on to how and when transfer learning is applicable in terms of comparing the ‘similarity’ of the datasets we have i.e. the initial dataset and the dataset we want to perform transfer learning on.

How do we do it? How is transfer learning done is a very simple question. We have the values of the SOTA algorithms corresponding to various layers, all we do in practice is ‘freeze’ these layers cut the last or the last two layers off, and add layers corresponding to our dataset. This is required because if you remember correctly the last Dense layer should correspond to the number of classes we have. Libraries like Keras and Tensorflow also allows us to ‘fine-tune’ these values with a small learning rate on those layers. Effectively, you get an extremely good starting point for your data and utilize the best of the algorithms out there without writing thousands of lines of code.

How do we do it? How is transfer learning done is a very simple question. We have the values of the SOTA algorithms corresponding to various layers, all we do in practice is ‘freeze’ these layers cut the last or the last two layers off, and add layers corresponding to our dataset. This is required because if you remember correctly the last Dense layer should correspond to the number of classes we have. Libraries like Keras and Tensorflow also allows us to ‘fine-tune’ these values with a small learning rate on those layers. Effectively, you get an extremely good starting point for your data and utilize the best of the algorithms out there without writing thousands of lines of code.

With that, we have enough knowledge to now start coding. In this lecture, we will take a look at Inception_V3 but feel free to research many more algorithms you can use. To get a sense of how to use your own data, we will be following a python file because this will give you a feel of how to make proper projects. So without further ado, let’s get into it. This time we will build a Trash type classifier using transfer learning. There is an excellent dataset provided by Gary Thung on GitHub. Please find the link to the folder here, fortunately, we have already done the training and testing split for you so there is no need to worry about that.

Before starting, take a look at the architecture of the Inception_V3 network:

Take a moment to scan the network. The Inception module factors in several new concepts like Auxillary Classifiers, Efficient Grid size reduction and you can find a wonderful introduction to these topics in this article but for now, all you need to understand is Inception_V3 is a super powerful neural network. Let’s start by creating some basic imports and defining the base pre-trained Inception model:

from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D# create the base pre-trained modelbase_model = InceptionV3(weights='imagenet', include_top=False)

Be sure to either use tensorflow.keras or solely keras. Mixing up the two in the same project can lead to conflict errors. Here we have defined the model, in a way that the ‘top’ of the model is not taken into account and gives us space to adjust our model according to our requirements. If you remember correctly the last layers are specific to the number of classes we have and Inception was pre-trained on a different dataset thus we chose to leave it out. The following code adjusts the model for the same:

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- we have 6 classes
predictions = Dense(6, activation='softmax')(x)
# this is the model we will train
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
     layer.trainable = False# compile the model (should be done *after* setting layers to non-trainable)model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

Now that’s quite a bit of code so let’s understand what we are doing here. I take my base model to add a Global Spatial pooling layer just for fun :), another Dense layer for good measure, and a finally logistic layer(fancy name for final Dense Layer) as output containing the number of neurons as a number of classes. I defined the model with a Model() object provided to us in Keras. In this tutorial we are going to follow a very innovative method of training which follows a two-pronged approach, however, if you want to train in one go we have attached some code for that as well. Finally, I compile my model using the ‘RMSProp’ optimizer and the categorical cross-entropy loss measure. The only thing remaining now is fitting the data! You must be wondering what that mini for loop in the code is doing. What we are doing here is first ‘freezing’ all convolutional layers and training only the layers we added meaning adding optimum ‘value’ to that and then in the next half, we will freeze the top and train some convolutional layers to perfect our model. This is what I meant by our two-pronged approach: first training our layers and then training the rest of the layers

This would be a good time to introduce data generators as well. Researchers found out an intuitive way to deal with sparse data: data augmentation. This technique just uses the existing data and creates replications of it by modifications. Common modifications involve rotation, shifting, and shearing. If we apply all three techniques to images we can get 3 new images of the original image increasing our total data by 4 folds, pretty amazing right? Let’s see it in action:

from tensorflow.keras.preprocessing.image import ImageDataGeneratortrain_datagen = ImageDataGenerator(rescale = 1./255, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True)test_datagen = ImageDataGenerator(rescale = 1./255)training_set =  train_datagen.flow_from_directory('data/train', target_size = (224, 224), batch_size = 32, class_mode = 'categorical')test_set = test_datagen.flow_from_directory('data/test', target_size = (224, 224), batch_size = 32, class_mode = 'categorical')# train the model on the new data for a few epochs
model.fit_generator(training_set, 
       validation_data = test_set,
       epochs = 1,
       steps_per_epoch=len(training_set),
       validation_steps=len(test_set)
 )

Running this would train our top 3 layers. As you can see the ‘ImageDataGenerator’ we defined augments our data by applying the transformations instructed. You can read more about it right here. Data augmentation is a great way to deal with sparse data. What we do here is define these generators available to us through Keras and create data during training time and like that we have enough data to train.

Go ahead and run this code for as many epochs as you like. If you would like to improve your model further, let’s take a look at how that is done(Second part of our two-pronged approach)

# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.
# let's visualize layer names and layer indices to see how many layers# we should freeze:for i, layer in enumerate(base_model.layers):
     print(i, layer.name)
# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 249 layers and unfreeze the rest:
for layer in model.layers[:249]:
    layer.trainable = Falsefor layer in model.layers[249:]:
    layer.trainable = True
# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from tensorflow.keras.optimizers import SGDmodel.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')
# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layersmodel.fit_generator(training_set, 
                    validation_data=test_set,
                    epochs=10,
                    steps_per_epoch=len(training_set),                                           validation_steps=len(test_set))model.save('trash_classification_model.h5')

We first, visualize all the layers in our model and inspect what all layers to choose from. We chose some for you for this tutorial :). Every layer will have a ‘trainable’ parameter which is default at False. Note, you can choose to set all layers to True but the training time will increase drastically. After doing this, let’s train our model using the same generator and use a new command

to create and save our model for future use. And that was it! Believe it or not, Tensorflow makes it easy to utilize such complicated algorithms.

So what you did right now is used transfer learning to create a trash classifier and saved the model for the future. That’s a lot of concepts so take time to process it and take up more problems on transfer learning.

Applications of Transfer Learning

Real-World Simulations — Believe it or not but with transfer learning companies are trying to achieve full self-driving technology with video data of games like GTA-5! Crazy? I know right. Transfer learning makes this possible because they don’t forget important information whereas vanilla Artificial Neural Networks have a high chance of doing so. While this is not always successful due to the unpredictable nature of the real-world.

Transferring Knowledge across domains — Transfer Learning can be revolutionary in the world of Natural Language Processing. Unlike computer vision which is generalized throughout the world, NLP requires the formation of individual algorithms for individual languages and in the domain of NLP usually, data required is in the billions for a SOTA algorithm.

With that, we come to the end of a vast tutorial. Transfer Learning has an umpteen number of applications and is definitely the future of AI. I hope you learned something new today. Follow me for more content on AI and more!

To learn more about AI concepts, check out my online tutorial and book at https://aimpact.in

Additional Resources(MUST READS):

How Transferable are features in Deep Neural Networks?

Tensorflow’s tutorial on Transfer Learning

What is Transfer Learning all about?

Written by Arjun Pandey

No responses yet