How to use CNNs as feature extractors?

Fernando Pereira dos Santos
birdie.ai
Published in
6 min readMay 12, 2021

Convolutional Neural Networks, called CNNs, are deep supervised architectures with the main purpose of classifying images in a number of predefined classes. As a brief introduction, let’s first cover the concept of supervised networks. Suppose we have images of five different animals and these images are already labeled, i.e we visually know what each image contains: cats, dogs, kangaroos, birdies, or spiders. When we supply the data and their respective labels to a CNN, the model will learn from the data provided, directing its learning to the pair (image, label). Thus, when the network observes a kangaroo image, it will have its targeted learning label provided. Consequently, its learning is supervised by prior knowledge. And what do Convolutional Networks differ from conventional neural networks? CNNs are networks developed specifically for the treatment of data in the form of images, based on the concept of many convolutional layers. These layers are evolutions of the conventional layers in which the filters are applied (coming from Image Processing), sharing a common processing mask. For a more detailed knowledge about the functioning of the layers contained in a CNN, I recommend reading the paper [Ponti, 2017].

General structure of Convolutional Neural Network. In this example, the model indicates that the class A is more suitable to the input image. From: Santos, F. P; “Features transfer learning between domains for image and video recognition tasks”, PhD Thesis for University of São Paulo, 2020.

CNNs are known for providing good performance and high generalizability for classification tasks. The concept of generalization is directly linked to the performance achieved in data not seen by the network. Thus, we say that the network has generalization if it performs well with the known data (used during training) and with the unknown data (used for testing). However, for a CNN to be able to learn concepts intrinsic to the task, a lot of training data is necessary and, rarely, we will have this amount sufficient. So, an alternative presents itself as a possible solution: using a CNN that has previously been trained as a feature extractor. With this approach, we avoid the need to train the network or adjust its learning.

Among the many existing pre-trained architectures that we can use as feature extractors, we can mention the ResNet50 [He, 2016] and MobileNet [Howard, 2017]. ResNet50 is an architecture that incorporates the concept of residual blocks, i.e for every three convolutional layers the input of the first one is combined with the output of the third layer. Thus, this combination provides the propagation of attributes that would be lost due to the depth of the network. This architecture is highly used in problems of feature extraction in computer vision. Another widely used network is MobileNet. This architecture is characterized by providing less complexity in relation to the others in exchange for a slightly worse performance. Thus, in scenarios that require little computational resources, MobileNet can be applied to problems that do not require as much precision in the prediction.

When we are using pre-trained convolutional networks we have to select which layer will act as an extractor. Looking at the network internally, the last layer provides the probabilities regarding the input image. If we are using a pre-trained network trained with ImageNet dataset [Russakovsky, 2015] (commonly used in Convolutional Networks and containing 1000 classes), the prediction layer (the last one) will provide 1000 values that correspond to the probabilities of the image entered in relation to the original 1000 classes. Consequently, this layer should not be used as a feature extractor. But then, which layer should we choose? Several studies have observed that the initial layers of the network provide low-level features, comprising information about shapes and colors and, as the network deepens towards its end, information about texture and semantics are incorporated into the features [Yosinski, 2014]. In this context, we have also observed that a subsequent layer is a combination of layers prior to it. Thus, the pre-prediction layer is commonly used as a feature extractor.

In our practical example, we will adopt ResNet50 as a feature extractor. However, the process is the same regardless of the chosen architecture. Then, the attributes obtained will form the input of our classifier. Here, we adopt Support Vector Machine (SVM) to distinguish the classes, but it could be any classifier. Also, I am considering that all the treatment of the images was performed previously (loading the images and modifying the resolution to according to the CNN input).

In this first code snippet, we load the ResNet50 model. Note that the parameters indicate that we are considering all layers (include_top = True), the weights that were obtained from training with ImageNet (weights = ‘imagenet’), the format of the input (input_shape=(224, 224, 3)) and the number of classes corresponding to the previous training (classes=1000). However, we do not want to use the entire network, only from the first layer to the penultimate layer. Thus, in the second line, we restrict the model according to the desired output layer. In the sequence, we use the training and test data as input of this resulting model. When applying the predict function, we get the features for this data in the layer determined as an output.

With the features in hand (Xtrain and Xtest), it is only necessary to train the SVM classifier and test it according to some classification metric. In this example, accuracy was the metric chosen, achieving 78,8%. I want to emphasize that this is just a didactic example and does not imply an optimal solution. To see the complete code example visit: https://github.com/fernandopersan/medium/blob/main/CNN_features.ipynb

To conclude, we observed that when our dataset of images is small, training a CNN is not the most correct way to follow. CNN will only memorize the training set and will not provide good performance for new examples to be classified. Thus, a good approach is to use a pre-trained CNN in order to be a feature extractor. However, how can we adjust a pre-trained network for my computer vision problem (network fine-tuning)? Ah, this is a topic for a new post. To the next!

References:

[He, 2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[Howard, 2017] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

[Ponti, 2017] M. Ponti, L. S. Ribeiro, T. S. Nazare, T. Bui, and J. Collomosse, “Everything you wanted to know about deep learning for computer vision but were afraid to ask,” in 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T 2017), 2017, pp. 17–41.

[Russakovsky, 2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.

[Santos, 2020] Santos, F. P; “Features transfer learning between domains for image and video recognition tasks”, PhD Thesis for University of São Paulo, 2020.

[Yosinski, 2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.

--

--