A guide to transfer learning with Keras using ResNet50

11 min readJul 4, 2020

Abstract

In this blog post we will provide a guide through for transfer learning with the main aspects to take into account in the process, some tips and an example implementation in Keras using ResNet50 as the trained model. The task is to transfer the learning of a ResNet50 trained with Imagenet to a model that identify images from CIFAR-10 dataset. Several methods were tested to achieve a greater accuracy which we provide to show the variety of options for a training. However with the final model of this blog we get an accuracy of 94% on test set.

Introducción

Learning something new takes time and practice but we find it easy to do similar tasks. This is thanks to human association involved in learning. We have the capability to identify patterns from previous knowledge an apply it into new learning.

When we meet a person than is faster or better than us in something like a video game or coding it is almost certain that he has do it before or there is an association with a previous similar activity.

If we know how to ride a bike, we don’t need to learn from zero how to ride a motorbike. If we know how to play football, we don’t need to learn from zero how to play futsal. If we know how to play the piano, we don’t need to learn from zero how to play another instrument.

The same is applicable to machines, if we train a model with a database, it’s not necessary to retrain from zero all the model to adjust to a new similar dataset. Both Imagenet and CIFAR-10 have images that can train a model to classify images. Then, it is very promising if we can save time training a model (because it can really take long time) and start using the weights of a previously trained model. We are going through this concept of transfer learning with all what you need to also build a model on your own.

Materials and Methods

Setting our environment

We are going to use Keras which is an open source library written in Python for neural networks. We work over it with tensorflow in a Google Colab, a Jupyter notebook environment that runs in the cloud.

The first thing we do is importing the libraries needed with the line of code below. Running the version as 1.x is optional, without that first line it will run the last version of tensorflow for Colab. We also use numpy and a function of tensorflow but depending on how you build your own model is not necessary to import them.

%tensorflow_version 1.ximport tensorflow.keras as K

Training a model uses a lot of resources so we recommend using a GPU configuration in the Colab. This will speed up the process and allow more testing. We will talk about some other ways to improve computation soon.

Database

CIFAR-10 is a dataset with 60000 32x32 colour images grouped in 10 classes, that means 6000 images per class. This is a dataset of 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories.

The categories are airplane, automobile, beer, cat, deer, dog, frog, horse, ship, truck. We can take advantage of the fact that these categories and a lot more are into the Imagenet collection.

To load a database with Keras, we use:

tf.keras.datasets.cifar10.load_data()

Preprocess

Now that the data is loaded, we are going to build a preprocess function for the data. We have X as a numpy array of shape (m, 32, 32, 3) where m is the number of images, 32 and 32 the dimensions, and 3 is because we use color images (RGB). We have a set of X for training and a set of X for validation. Y is a numpy array of shape (m, ) that we want to be our labels. Since we work with 10 different categories, we make use of one-hot encoding with a function of Keras that makes our Y into a shape of (m, 10). That also applies for the validation.

As we said before, we are going to use ResNet50 but there are also many other models available with pre-trained weights such as VGG16, ResNet101, InceptionV3 and DenseNet121. Each one has its own preprocess function for the inputs.

Preprocess data using predefined function from Keras

Next, we are going to call our function with the parameters loaded from the CIFAR10 database. It’s important to get to know your data to monitor the steps and know how to build your model. Let’s print the shapes of our x_train and y_train before and after the preprocessing.

Here we unpack both train and test variables directly and bring them to the preprocess function.

((50000, 32, 32, 3), (50000, 1))
((50000, 32, 32, 3), (50000, 10))

Using weights of a trained neural network

A pretrained model from the Keras Applications has the advantage of allow you to use weights that are already calibrated to make predictions. In this case, we use the weights from Imagenet and the network is a ResNet50. The option include_top=False allows feature extraction by removing the last dense layers. This let us control the output and input of the model.

From this point it all comes to testing and a bit of creativity. The starting point is very advantageous since we have weights that already serve for image classification but since we are using it on a completely new dataset, there is a need for adjustments. Our objective is to build a model that has high accuracy in their classifications. In this case, if an image of a dog is presented, it successfully identifies it as a dog and not as a train, for example.

Let’s say we want to achieve an accuracy of more than 88% on training data but we also wish that it doesn’t have overfitting. How do we get this? Well at this point our models may diverge, this is where we test what tools we can use for that objective. The important here is to learn about transfer learning and making robust models. We follow an example but we can run with different approaches that we will discuss.

The two aproaches you can take in transfer learning are:

Feature extraction
Fine tuning

This refers on how you use the layers of your pretrained model. We have already a very huge amount of parameters because of the number of layer of the ResNet50 but we have calibrated weights. We can choose to ‘freeze’ those layers (as many as you can) so those values doesn’t change, and by that way saving time and computational cost. However as the dataset is entirely different is not a bad idea to train all the model

In this case, we ‘freeze’ all layers except for the last block of the ResNet50. The way to do this in Keras is with:

We can check that we did it correctly with:

Print the layers to check which are trainable

The output is something like this (the are more layer that we omit). False means that the layer is ‘freezed’ or is not trainable and True that when we run our model, the weights are going to be adjusted.

133 conv4_block6_1_conv - False
134 conv4_block6_1_bn - False
135 conv4_block6_1_relu - False
136 conv4_block6_2_conv - False
137 conv4_block6_2_bn - False
138 conv4_block6_2_relu - False
139 conv4_block6_3_conv - False
140 conv4_block6_3_bn - False
141 conv4_block6_add - False
142 conv4_block6_out - False
143 conv5_block1_1_conv - True
144 conv5_block1_1_bn - True
145 conv5_block1_1_relu - True
146 conv5_block1_2_conv - True
147 conv5_block1_2_bn - True
148 conv5_block1_2_relu - True
149 conv5_block1_0_conv - True
150 conv5_block1_3_conv - True
151 conv5_block1_0_bn - True
152 conv5_block1_3_bn - True
153 conv5_block1_add - True

Later, we need to connect our pretrained model with the new layers of our model. We can use global pooling or a flatten layer to connect the dimensions of the previous layers with the new layers. With just a flatten layer and a dense layer with softmax we can perform close the model and start making classification.

model = K.models.Sequential()
model.add(res_model)
model.add(K.layers.Flatten())
model.add(K.layers.Dense(10, activation='softmax'))

The final layers are below, you can see the complete code here. However we explain some more aspects to improve the model and make a good classification. We present the main aspects taken into account to build the model.

The complete model using a Sequential structure. Note that the variable res_model is the pretrained ResNet50

We have regularizers to help us avoid overfitting and optimizers to get a faster result. Each of them can also affect our accuracy, so we present what to take into account. The most important are:

Batch size: It is recommended to use a number of batch size with powers of 2 (8, 16, 32, 64, 128, …) because it fits with the memory of the computer.
Learning rate: For transfer learning it is recommended a very low learning rate because we don’t want to change too much what is previously learned.
Number of layers: This depends on how much you relay from the layers of the pretrained model. We found that if we leave all the model for training just a flatten layer and a dense with softmax is enough but since we incorporated the feature extraction it was required more layers at the end.
Optimization methods: We tested with SGD and RMSprop. SGD with a very low learning required more epochs (30) to complete a razonable training. We used RMSprop with 5 epochs to get our result.
Regularization methods: To avoid overfitting we used Batch normalization and dropout in-between the dense layers.
Callbacks: In Keras, we can use callbacks in our model to perform certain actions in the training such as weight saving.

This callback saves the weights obtained in the training

We save the model in a file called “cifar10.h5”

Results

We obtained an accuracy of 94% on training set and 90% on validation with 10 epochs. In the 8th epoch, the values are very similar and it is interesting to note that in the first validation accuracy is higher than training. This is because of dropout use, which in Keras, it has a different behavior for training and testing. In testing time, all the features are ready and the dropout is turned off, resulting in a better accuracy. This readjust on the last epochs since the model continues changing on the training.

Train on 50000 samples, validate on 10000 samples
Epoch 1/10
50000/50000 [==============================] - 209s 4ms/sample - loss: 2.0074 - acc: 0.3243 - val_loss: 0.7314 - val_acc: 0.8155
Epoch 2/10
50000/50000 [==============================] - 174s 3ms/sample - loss: 1.3455 - acc: 0.5641 - val_loss: 0.5675 - val_acc: 0.8477
Epoch 3/10
50000/50000 [==============================] - 173s 3ms/sample - loss: 1.0404 - acc: 0.6878 - val_loss: 0.5090 - val_acc: 0.8662
Epoch 4/10
50000/50000 [==============================] - 171s 3ms/sample - loss: 0.8473 - acc: 0.7660 - val_loss: 0.4338 - val_acc: 0.8818
Epoch 5/10
50000/50000 [==============================] - 170s 3ms/sample - loss: 0.6959 - acc: 0.8179 - val_loss: 0.4097 - val_acc: 0.8858
Epoch 6/10
50000/50000 [==============================] - 170s 3ms/sample - loss: 0.5788 - acc: 0.8571 - val_loss: 0.3768 - val_acc: 0.8968
Epoch 7/10
50000/50000 [==============================] - 169s 3ms/sample - loss: 0.4806 - acc: 0.8858 - val_loss: 0.3430 - val_acc: 0.9035
Epoch 8/10
50000/50000 [==============================] - 169s 3ms/sample - loss: 0.3988 - acc: 0.9104 - val_loss: 0.3474 - val_acc: 0.9022
Epoch 9/10
50000/50000 [==============================] - 170s 3ms/sample - loss: 0.3345 - acc: 0.9289 - val_loss: 0.3339 - val_acc: 0.9055
Epoch 10/10
50000/50000 [==============================] - 168s 3ms/sample - loss: 0.2858 - acc: 0.9404 - val_loss: 0.3463 - val_acc: 0.9002
Model: "sequential_8"

The summary of the model is below. We found that batch normalization and dropout greatly reduces overfitting and it helps get better accuracy on validation set. The method of ‘freezing layers’ allows a faster computation but hits the accuracy so it was necessary to add dense layers at the end. The shape of the layers holds part of the structure of the original ResNet50 like it was a continuation of it but with the features we mentioned.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lambda_7 (Lambda)            multiple                  0         
_________________________________________________________________
resnet50 (Model)             (None, 1, 1, 2048)        23587712  
_________________________________________________________________
flatten_7 (Flatten)          multiple                  0         
_________________________________________________________________
batch_normalization_24 (Batc multiple                  401408    
_________________________________________________________________
dense_24 (Dense)             multiple                  25690368  
_________________________________________________________________
dropout_17 (Dropout)         multiple                  0         
_________________________________________________________________
batch_normalization_25 (Batc multiple                  1024      
_________________________________________________________________
dense_25 (Dense)             multiple                  32896     
_________________________________________________________________
dropout_18 (Dropout)         multiple                  0         
_________________________________________________________________
batch_normalization_26 (Batc multiple                  512       
_________________________________________________________________
dense_26 (Dense)             multiple                  8256      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
_________________________________________________________________
batch_normalization_27 (Batc multiple                  256       
_________________________________________________________________
dense_27 (Dense)             multiple                  650       
=================================================================
Total params: 49,723,082
Trainable params: 40,909,770
Non-trainable params: 8,813,312
_________________________________________________________________

For ResNet50 what helped more to achieve a high accuracy was to resize the input from 32 x 32 to 224 x 224. This is because of how the model was constructed which in this sense was not compatible with the dataset but it was easy to solve by fitting it to the original size of the architecture. There was the option of using UpSampling to do this task but we find that the use of Keras layers lambda was way faster.

Training and validation accuracy. We can see the behavior of the dropout technique that adjusts the more epochs.

Training and validation loss. There is a point in which more training doesn’t change that much the results.

Discussion

We confirmed that ResNet50 works best with input images of 224 x 224. As CIFAR-10 have 32 x 32 images, it was necessary to perform a resize. With this adjustment alone, the model can achieve a high accuracy, I think it was the most important for ResNet50.

A good recommendation when building a model using transfer learning is to first test optimizers to get a low bias and good results in training set, then look for regularizers if you see overfitting over the validation set.

The discussion over using freezing on the pretrained model continues. It reduces computation time, reduces overffiting but lowers accuracy. When the new dataset is very different from the datased used for training it may be necessary to use more layer for adjustment.

On the selecting of hyperparameters, it is important for transfer learning to use a low learning rate to take advantage of the weights of the pretrained model. This choice as the optimizer choice (SGD, Adam, RMSprop) will impact the number of epochs needed to get a successfully trained model.

References

keras - Transfer Learning using Keras and VGG | keras TutorialIn this example, three brief and comprehensive sub-examples are presented: Loading weights from available pre-trained…
riptutorial.com

Building powerful image classification models using very little dataIn this tutorial, we will present a few simple yet effective methods that you can use to build a powerful image…
blog.keras.io

A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep LearningDeep Learning on Steroids with the Power of Knowledge Transfer!
towardsdatascience.com

Defining model in keras (include_top = True)Asked Can somebody tell me what include_top= True means when defining a model in keras? I read the meaning of this line…
stackoverflow.com

https://www.cs.toronto.edu/~kriz/cifar.html

A guide to transfer learning with Keras using ResNet50

Abstract

Introducción

Materials and Methods

Using weights of a trained neural network

Results

Discussion

References

keras - Transfer Learning using Keras and VGG | keras Tutorial

In this example, three brief and comprehensive sub-examples are presented: Loading weights from available pre-trained…

Building powerful image classification models using very little data

In this tutorial, we will present a few simple yet effective methods that you can use to build a powerful image…

A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning

Deep Learning on Steroids with the Power of Knowledge Transfer!

Defining model in keras (include_top = True)

Asked Can somebody tell me what include_top= True means when defining a model in keras? I read the meaning of this line…

Written by Kenneth Cortés Aguas