Transfer Learning Using ResNet50 and CIFAR-10

Andrew Dabydeen
8 min readMay 7, 2019

--

How can we utilize a pre-trained network to help us classify a new dataset?

Transfer Learning

Introduction

Often at times, we might run into a situation where we want to leverage the power of machine learning but we don’t have enough data to accurately create a model. This is often the case with deep learning models where we want to use Convolutional Neural Networks (CNN’s) to classify a certain image. CNN’s are useful as they break images down into matrices and try to capture certain spacial structures to make an accurate classification. We can then leave it up to a graphical processing unit (GPU) to do the computations that accurately classify an image.

Transfer Learning gives us the ability to leverage the power of having a large dataset without having to retrain a new model from scratch. There needs to be some training done but this is mainly due to the part of adding in our new dataset. The idea behind Transfer Learning is to use a pre-trained network that has been trained on a large enough image dataset that can act as a generic model of the world around us. We can then use this trained network on the images that we want to classify, tweak the model, and run our new architecture to see the classification results that we are looking for. There are two ways to use pre-trained networks:

  1. Feature Extraction
  2. Fine-Tuning

This inspiration of both options were obtained from François Chollet who created strong examples for this idea, which is referenced at the end of this article.

The purpose of this experiment is to focus on the first option, feature extraction, and we will use the ImageNet architecture, ResNet50 as our pre-trained model. There are numerous transfer learning architectures that could be chosen such as VGG16, VGG19, MobileNet, etc. They all have their pros and cons for certain situations. For example, MobileNet is meant to be fast and flexible and works best on mobile devices. These models are trained on ImageNet dataset which contained 1.4 million labeled images and 1000 different classes. This is great given the fact that we might not have enough data to capture certain spacial features with our small dataset that we are looking to classify. An example of the ResNet50 architecture that was trained on ImageNet is shown in Image 1.

Image 1 — Example of ResNet50 Architecture

Dataset

Leveraging the power of Transfer Learning is best shown on when we have a dataset that it hasn’t been trained on yet. CIFAR-10 was chosen for this very purpose and due to the fact that it contains a large amount of images that span over 10 classes (10 possible outcomes).

More information about CIFAR-10 can be found at the following link —

https://www.cs.toronto.edu/~kriz/cifar.html

From the CIFAR-10 documentation, “The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images” which be leveraged in this scenario when we train and test our model. One of the very lucrative features about the CIFAR-10 dataset is that there is no need to download the images separately and load them into a directory. We are able to import it from the Keras Datasets packages. There needs to be some pre-processing done beforehand since ResNet50 requires images to have a minimum of 200x200 pixels while the CIFAR-10 dataset has images of 32x32 pixels. This can be done by either reshaping the images beforehand or up-scaling to images before we input them into the convolutional layers. Image 2 shows a few examples of the pictures that are contained in the CIFAR-10 dataset.

Image 2 — Example of images in CIFAR10

Model Architecture

To get the CIFAR-10 dataset to run with ResNet50, we’ll need to first upsample our images 3 times, to get them to fit the ResNet50 convolutional layers as mentioned above. There are additional ways to do this, such as using the Keras built in function ImageDataGenerator but for the purposes of running the model, upsampling will also work. Upsampling is simply a way to magnify our image to make it bigger. After the images are ready for the ResNet50 layers, we can pass through our images and then take the output, flatten it, and pass it do a fully connected network consisting of two hidden layers (one with 128 neurons and the other with 64 neurons). These two layers both consist of a BatchNormalization layer before the actual layer and a dropout layer for the output (the dropout layer consisted of a probability of 0.5). Lastly, we have a final dense layer as the output with 10 neurons and a softmax output for the 10 classes that exist in CIFAR10. Softmax will essentially give us the probability of each class, in this case the 10 outcomes, which should all sum up to equal 1. The reason why the architecture is changed from what Chollet originally gave was to see if the accuracy of the classification would potentially increase. Batch normalization is used for improving the speed, performance, and stability of artificial neural networks while dropout is used to take out a specific neuron with a 0.5 probability, this is to ensure that model isn’t just memorizing certain patterns. The model summary is shown in Figure 1.

Figure 1 — Model Summary

Model Training

Since ResNet50 is large, in terms of architecture, it’s computationally expensive to train. The new images from CIFAR-10 weren’t predicted beforehand on the ResNet50 layers, so the model ran for 5 epochs to get the classification to a 98% accuracy. If a larger dataset existed, the convolutional base would need to be run over the entire dataset to speed up the training process. Both ways lead to equivalent results but the way above, is much slower and more computationally expensive. However, this allows us to leverage data augmentation during training. Figure 2 shows us that we can achieve a relatively high training and validation accuracy after only 5 epochs. Our loss leveled off around 0.05 for both training and validation while the training and validation accuracy reached levels around 99%. Training time was slow as it took approximately 10 minutes per epoch to train, for a total of 50 minutes.

Figure 2 — Training/Validation Loss and Accuracy

After training, the model was then evaluated on the testing set and achieved an accuracy of approximately 98.70%, as shown in Image 3. This tells us that every time we pass an image to a network, it’ll correctly identity the image approximately 99% of the time.

Image 3 — Test Accuracy for Model

Visualizing Intermediate Actions

After finalizing our model and ensuring that we have a high enough classification accuracy, we have the ability to go within the CNN layers and visualize the intermediate activations for either the convolutional or pooling layers that are done in ResNet50. Activations are basically the pixel values that have high correlation with correctly identifying what an object is. Overtime, these values will be adjusted for the various images that get passed through. This basically gives us a way to visualize what each layer is doing and what it’s trying to focus on within a certain image that is passed for classification. The whole network can be thought of as a map on how an image is decomposed into important features, which can then be used for classification once passed into the fully connected layers towards the end of the model.

As an example, let’s take a picture of an airplane. Since CIFAR-10 has its images in a 32x32 pixel quality, we won’t get the most accurate activation visualization. This experiment can be run again with images of a higher quality (more pixels) to see a better representation of how the activations look for a particular image. When an image is inputted into the model, we’ll get various values for the different layers so we have a few options that we can potentially choose from. Let’s focus on the first convolutional layer in ResNet50 and look at the 5th channel. Each channel represents different features that the network is focusing on for the image. The results are shown in Figure 3 (right) as compared to the original image (left). As you can see, the activation image does resemble the airplane.

Figure 3 — Image of Airplane and Image of Activated Neurons

Conclusion

Transfer Learning is a great technique to use if we’re running an experiment where we don’t have a large enough dataset. There exists quite a few models that can be leveraged, most which have been trained on the ImageNet dataset which has over 1.4 million images. This pre-trained network captures a lot of spacial hierarchies and as we can see from above, and does a great job when we input a dataset that it hasn’t been trained on. We’ll need to tweak the fully connected network towards the ending of the model but we end up with high classification accuracies, in this example, we obtained a 98% testing accuracy.

Once we have a finalized model, we are then able to open up the Convolutional Neural Network and look at the various activations for the different layers. Each layer focuses on a different features of an image, (edges, eyes, etc.) and can tell us a lot information on what the neural network is trying to do.

References

The inspiration for the work done above was obtained from François Chollet who broke down many of these deep learning concepts on his GitHub. Please see his following notebook on Transfer Learning —

Please see his following notebook on CNN visualization —

GitHub Repository

For the full analysis and code used to run the model above, please visit the following link —

https://github.com/frlim/data2040_final

--

--

Andrew Dabydeen

Data Scientist | Brown University M.S. 19' | Cornell University B.S. 16'