How to take advantage of previous CNN training and operate transfer learning?

Fernando Pereira dos Santos
birdie.ai
Published in
4 min readJun 29, 2021

In the last topic (see the previous post in https://medium.com/birdie-ai/how-to-use-cnns-as-feature-extractors-54c69c1d4bdf), I presented introductory concepts about Convolutional Networks (CNNs) and the process of feature extraction using these architectures. Feature extraction is a widely used approach when we want to take advantage of the potential of a deep network without having the necessary amount of examples to train the network directly. Thus, we realize that Convolutional Networks have a high descriptive capacity. On the other hand, when we have a substantial amount of training examples, we can use a second resource: network fine-tuning. This approach also reuses a pre-trained network; however, to adapt the previous training to the current task. We know that the initial layers of the network provide learning of low-level features, colors and shapes, which are independent of the dataset used. So, regardless of the context we want to apply, these layers provide very similar features, i.e. these layers are more generic. In contrast, the final layers of the network are more specific, in which its learning is totally dependent on the examples provided for the training [Yosinski, 2014]. Therefore, we can refine the previous training of the network for a new task, readjusting the weights of the network, especially in the final layers.

General structure to fine-tune a Convolutional Neural Network: (a) the original model model; (b) we can change only the prediction layer to n classes; © we also can modify some layers before the prediction one. From: Santos, F. P; “Features transfer learning between domains for image and video recognition tasks”, PhD Thesis for University of São Paulo, 2020.

So, let’s go straight to a practical example. As in the feature extraction procedure, we load the pre-trained model (ResNet50 [He, 2016]) and identify up to which layer we want to consider. For example, we are considering from the first layer (input) to the penultimate (-2). The initial model was prepared to predict 1000 classes from the training set (ImageNet [Russakovsky, 2015]). In our example, the dataset used contains only 10 classes. Consequently, we will modify the final prediction layer, adding a Dense layer with 10 neurons (numberClasses = 10). Note that the activation function chosen was softmax, which will transform the received values into probabilities. Our new architecture is ready. We now need to compile the structure with the training settings. Every CNN requires a loss function and a method of optimization, as well as an evaluation metric. For details and the variety of possibilities for these items, I recommend reading the more detailed paper [Ponti, 2017] and/or the documentation of the Keras library (https://keras.io/api/). For our example, the loss function chosen was “categorial_crossentropy” with “adam” optimizer.

Now what we need to do is retrain the network so that we can test its performance in our test set. To shuffle the training set, we can apply the permutation and, then, train the network using the batch concept. The Keras library offers other training functions, but I chose this one so that the didactics in relation to the batch would be more evident. When we train a CNN, we want all images to be passed on to architecture. However, the available RAM may not be sufficient to load all examples. Then, the batch size defines a certain number of images to be passed on to CNN at a time. Thus, when all images are uploaded to the network, an epoch is completed. We must do this process for “n” epochs (numberEpochs in the code). In the literature, we have several heuristics to determine the ideal batch size, but we will leave that for a future topic.

As the training finished, we used the test set to verify the performance achieved. Hence, we just use the evaluate function in the trained model. See the full example at https://github.com/fernandopersan/medium/blob/main/CNNFineTuning.ipynb

To conclude, fine-tuning network is recommended when we have a good number of examples, but not enough to train the network from scratch. So, we just need to change the final layers of the structure according to our objective and retrain the network. This concept of reusing previous weights and adapting them for a new task is called transfer learning. This area of study is currently being investigated for reducing processing costs and for taking advantage of the knowledge acquired previously.

References:

[He, 2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[Ponti, 2017] M. Ponti, L. S. Ribeiro, T. S. Nazare, T. Bui, and J. Collomosse, “Everything you wanted to know about deep learning for computer vision but were afraid to ask,” in 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T 2017), 2017, pp. 17–41.

[Russakovsky, 2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.

[Santos, 2020] Santos, F. P; “Features transfer learning between domains for image and video recognition tasks”, PhD Thesis for University of São Paulo, 2020.

[Yosinski, 2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.

--

--