Computer vision techniques with Python

Joangoal
eDreams ODIGEO
Published in
14 min readJul 29, 2021

Introduction

Computer vision consists in converting images and videos into signals that can be understood and processed. Some of the biggest and fundamental tasks of computer vision are object recognition and/or image classification. Furthermore, Deep Learning has played an important role in the last years in the realms of machine learning and AI, it is based on artificial neural networks which were discovered in the fifties, but it has potentially increased since 2010 due the mathematical improvements provided by Geoffrey Hinton, the big data and the use of GPUs.

The project is focused on a multiclass classification model built with convolutional neural networks which is a class of Deep Learning. Convolutional neural networks provide one of the best fits for images processing and natural processing language. Object recognition and image classification can provide a great functionality in the near future facilitating the life of people such as recognizing objects for improving productivity among the processes, medical analysis, etc.

Artificial Neural Network (ANN)

The densely connected layer or artificial neural network consists in analysing the input in order to get an output for the classification. It is composed by a set of neurons that manage different parts of the object. When the number of activated neurons is higher than the threshold, that image is going to be considered and classified as this object.

Figure 1: Artificial neural network representation. Image extracted from [11].

The hidden layer, where N layers with M neurons can be located, is where the responsibility of every part of the image is stored with its own weight or influence, in order to determine the output. After every iteration the computed value and the expected value are compared. Then, it applies a backpropagation, where the weight of a given neuron on a neuron in the next layer is calculated and its influence adjusted. This process is iterating over and over and it is how the network learns from the relevant features of an object extracted from the data.

Convolutional Neural Network (CNN)

The convolutional layer extracts the features from an image taking all the pixels. It generates a feature map.

Figure 2: Example of feature map generation from a pixel information. Image extracted from [11].

So basically, it forms a representation of part of an image. A filter is used to build the representation of the image and the filter size determines how much of the image, how many pixels, are examined at the time. A common filter size value is 3, so this covers 3 x 3 of height and width.

However, the images have height, width and depth. The depth is the number of channels through which those images are analysed, grayscale images only have 1 colour channel meanwhile colours have 3 depth channels. For the colour images, the depth is determined for the RGB value for each single pixel. Then, for colour images the filter dimension is 3 x 3 x 3. That filter is moving and getting the representation through the entire image depending on a parameter called stride, which defines how many pixels must be skipped in order to calculate the new representation values.

Figure 3: Example of how convolutional neural filters work. Image extracted from [28].

After all that process a feature map is created and processed through an activation function or activation layer. Basically, the activation function determines the output of a node from a set of inputs, for instance an “ON” state (1) or “OFF” state (0).

Given that the images have a certain level of nonlinearity, the activation layer uses the values obtained from convolutional layers that are represented in linear form and converts into a non-linear. Then, the data passes through a pooling layer. The pooling layer compresses the image representation for features that are really important. Thus, it makes images smaller based on the relevant features and because of it, the network learns about those relevant features that truly represent the object that has been analysed. In addition, it helps prevent overfitting because it is important to represent the object instead of learning about the entire data image.

Several ways of pooling exist, in the Figure 4, the max pooling is shown that is the most frequently used. Basically, it takes the maximum value of those pixels.

Figure 4: Max pooling example. Image extracted from [11].

Finally, the last layers are connected to a fully connected layer (artificial neural network) which needs the data in a vector format. Due to that requirement the last layer of a convolutional neural layer flats this data. Basically, it flats the input received as compressed image representation to a vector data image representation.

Convolutional Neural Network Benchmark

Previous studies have defined a great benchmark using different convolutional deep neural networks:

· Alexnet [1] has shown that supervised learning with deep convolutional neural networks is able to get incredible results with a huge and complex dataset and remarks that the depth of the convolutional neural network is really important.

· VGG [2] (very deep convolutional networks) demonstrates that a wider network such as Alexnet but deeper with 19 convolution layers instead of 5 beat the previous convolutional network. Furthermore, VGG has also demonstrated that bigger convolution filters, 11x11 or 7x7 filters used by Alexnet, can be replaced by a few small 3x3 convolution filters improving the performance meanwhile a computation cost reduction.

· GoogLeNet [3] innovates in a convolutional network with an asymmetric network composed by modules called Inception module and the network is designed to reduce the computational cost and be practical. In fact, in order to reduce computational cost provoked by 3x3 and 5x5 convolutions, a 1x1 convolution is used to compute reduction. Moreover, the inception module uses max-pooling layers with stride 2 to halve the resolution. Basically, GoogLeNet is composed of those inception modules in higher layers which reduce the computational cost of those layers, then, it can become deeper with traditional convolutional lower layers.

· ResNet [4] solves one of the biggest issues for the VGG network. With deeper convolutional networks, the gradient is vanishing and the back-propagate through deeper layers is insignificant. Resnet configuration uses a smart way to propagate information for previous layers through shortcuts to deeper layers. Basically, it is composed of 34 convolutional layers mostly with 3x3 filters with stripe of 2 and it ends with a layer with average pooling and 1000-way fully connected layer with softmax.

· Xception [5] proposes a derivative convolutional neural network architecture paradigm from the Inception network. Basically, this new architecture relies on depth wise separable convolution layers, which are divided in layers in the entry flow, middle flow and exit flow. So, the data first goes through the entry flow, then to the middle flow and finally to the exit flow. Indeed, The Xception architecture has 36 convolutional layers which are structured in 14 modules: 4 modules for the entry flow, 1 module x 8 times for the middle flow and 2 modules for the exit flow. All of those modules have linear residual connections except the first one in the entry flow and the last one in exit flow and then apply the batch normalization [6]. In summary, this convolutional neural network independently computes convolution for each channel and then merges them to get the output. Therefore, Xception reduces the amount of connections between layers providing less parameters.

· MobileNet [7] is designed on depth wise separable convolutions except the first layer that is a total convolution layer. Each convolutional layer is followed by batch normalization and ReLU nonlinearity except the final one that has a softmax layer for classification. MobileNet is composed of 28 layers taking into account depth wise and pointwise convolutions, which are the layers that reduce the resolution to 1 with 1x1 convolution (followed by batch normalization and ReLU) before the following connected layer.

· NASNet [8] is based on a recurrent network [9] to generate convolutional architectures. In neural architecture search a controller is used to create the architectural hyperparameters used by neural networks, so, the controller recurrent neural network optimizes architectural hyperparameters iterating over the network in order to maximize the expected validation accuracy. NASNet is focused on building a scalable architecture therefore it is composed of two types of convolutional cells: normal convolutional cells that return a feature map with the same dimension of the input and reduction convolutional cells which return a feature map reduced by 2 in dimensionality. Over this architecture the recurrent network applies the algorithm to optimize it.

· EfficientNet [10] is built on top of NASNet but it defines three parameters alpha, beta, rho to manage depth, width and resolution of the convolutional network respectively. Thereby, those parameters can tune the network based on the different requirements even without a huge computer cost (a large GPU is not needed at all). It also provides different variants for those requirements, for instance, if high accuracy is needed EffientNet-B7 with 600x600 and 66M parameters can be a great option. Even though a low latency and smaller model is needed, EfficientNet-B0 with 224x224 and 5.3M parameters is an option.

Keras & TensorFlow 2.0

TensorFlow 2.0 is an end-to-end, open-source machine learning platform that facilitates the building and deployment of machine learning models created for Python by the Google Brain team [23]. It provides four key abilities: 1) Great performance in low-level tensor operations on CPU, GPU. 2) Computing the gradient of arbitrary differentiable expressions. 3) Scaling computation to many devices. 4) Model exportation for external runtime programs or devices. Furthermore, it compiles several algorithms and models which are state of the art to enable the implementation of deep neural networks for image recognition/classification and natural language processing (NLP).

Keras is the high-level API developed for human beings by TensorFlow 2.0. Keras is an accessible, highly productive interface for solving machine learning problems and it is focused on modern deep learning. Moreover, it provides a great abstraction and building blocks which facilitate the development and shipping of machine learning solutions, even providing high iteration velocity. It allows the entire research community to dispose of all the power and benefits of TensorFlow 2.0.

Computer vision techniques implementation examples

Introduction

We will work with two examples in order to consolidate the knowledge and demonstrate the same model could be used with different dataset to provide a classification system. Mainly, The core model is going to be identical but different datasets will be used. The first dataset is composed of tumor brain MRI images and the second one from dog breeds images.

1. Tumor brain detection

Dataset

The dataset [12] is composed of MRI images. Those MRI images represent two different classes: the first one with a healthy brain and the second with brain tumor.

Data preparation

Initially, the collected data, explained in the previous section, must be preprocessed or prepared. In that case, the data has been processed by resizing to 224 x 224 pixels due the convolutional neural network prerequisites (EfficientNetV0).

new_image_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))

In order to increase the size of the dataset, a great approach is applying data augmentation which consists of duplicating the train images with some degree of variations such as rotation, light exposure, zoom, among others.

The data augmentation in python could be done by the following class ImageDataGenerator:

from keras.preprocessing.image import ImageDataGenerator

It receives as a parameter the variation that is desired to be applied.

ImageDataGenerator(width_shift_range=[-200,200])

ImageDataGenerator(height_shift_range=0.5)

ImageDataGenerator(horizontal_flip=True)

ImageDataGenerator(vertical_flip=True)

ImageDataGenerator(rotation_range=90)

ImageDataGenerator(zoom_range=[0.5,1.0])

ImageDataGenerator(brightness_range=[0.2,1.0])

The dataset is split in train data and test data. In addition the train data is going to be split again to get the validation data. Validation data is the one used during each iteration/epoch in the training to monitor the performance and tune the parameters and hyperparameters of the model.

Figure 5: Basic split for the evaluation protocol.

Furthermore, the train data has been treated to follow the RGB range (255 max value) and the test data treated in order to get the growth values in a vector format.

import tensorflow as tf

X_train = X_train.astype(‘float32’) / 255

X_test = X_test.astype(‘float32’) / 255

num_classes = len(np.unique(y_train))

y_train_onehot = tf.keras.utils.to_categorical(y_train, num_classes)

y_test_onehot = tf.keras.utils.to_categorical(y_test, num_classes)

Model creation

First of all, the convolutional layer that is going to be used is EfficientNet-B0 [10] which is built on top NASNet but it defines three parameters alpha, beta, rho to manage depth, width and resolution of the convolutional network respectively. For this first iteration the EfficientNet-B0 with 224x224 and 5.3M parameters is the option

from keras_efficientnets import EfficientNetB0

conv_base = EfficientNetB0(weights=’imagenet’, include_top=False, input_shape=X_train[0].shape)

Then, the model is defined with that convolutional neural network, a flatten layer and finally a fully connected layer to get the output classification.

In every block of the hidden layers in the artificial neural network, there are also an activation layer, a dropout layer and a batch normalization. The dropout layer prevents overfitting and a batch normalization normalizes the inputs for the next layer.

Finally, the last dense layer has a softmax activation function in order to classify the image.

The following image shows the architecture of the neural network:

Figure 6: Neural network architecture.

Model training

The epoch and the optimizer must be defined. The epoch refers to a cycle through the whole training dataset. Meanwhile, the optimizer is responsible for tuning the weights of the neurons in order to achieve the lowest loss. In that case, ten epochs have been chosen and the Adam algorithm as an adaptive learning rate for the loss and for the back-propagation in the gradient descent.

from tensorflow.keras import optimizers

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

NOTE: It is a great strategy to use the ModelCheckpoint in order to get the best performance of the model. Let’s imagine that you configured too many epochs and in that moment you are facing overfitting. Then, in order to avoid repeating all the training you can use the model checkpoint:

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(

filepath=checkpoint_filepath,

save_weights_only=True,

monitor=’val_accuracy’,

mode=’max’,

save_best_only=True)

The batch size is defined as 32 which means that the training data is randomly split in groups of 32 images. Each block goes through the neural network calculating the loss and the gradient descent in order to train the network and its parameters for every epoch. So, the whole training data is processed in chunks per each epoch.

The model has reached an accuracy over 96% with the validation data. The learning curves of the training can be seen in the figure 7 where the results are shown representing the accuracy and the loss.

Figure 7: Loss and accuracy from the tumor brain model.

Model evaluation

Finally, the accuracy must be validated with the test data that has not been involved in the model. Indeed, that dataset demonstrates how the model is going to work with a population of images not related with our own dataset. The model achieves the following accuracy:

Figure 8: Final model evaluation of tumor brain detection

The model got 98’44% for the brain tumor classification. In order to check the performance of the model the confusion matrix could be calculated in order to check the predictions and the errors that the model does. It must be done to compare with the ground truth of that test dataset.

2. Dogs classification

Dataset

The dataset [13] consists of natural images of dogs breeds and the dataset size is 20580 natural images. Those images are classified in 120 categories or different breeds of dogs.

NOTE: The same methodology is applied for data preparation and model creation.

Model training

The training is similar to what we have done in the previous chapters, however due to the size of the dataset it has more cost and time investment for processing the larger number of samples. In that case, fifteen epochs were used in order to achieve the best results.

The model has reached an accuracy over 86% with the validation data. The learning curves of the training can be seen in the figure 9 where the results are shown representing the accuracy and the loss.

Figure 9: Loss and accuracy from the dog breeds study.

Model evaluation

Finally, the accuracy must be validated with the test data that has not been involved in the model as we have done with the previous examples. The model achieves the following accuracy:

Figure 10: Final model evaluation of dogs breeds classification

The model got 86’83% for the dogs classification. The confusion matrix could be super interesting to check the wrong predictions and the confusion between different breeds.

Conclusions

The model could be trained with different datasets obtaining great results and providing an assistive system for classification or recognition such as in the medical sector. Due to the great results 98.44% of accuracy in tumor brain detection, it could provide support to the doctors in the diagnosis and be a trustful tool for their daily lives. In addition, it could increase their productivity and provide a huge level of automation.

Regarding the dogs classification, it could also be helpful for vets to recognize their patients due that each breed has its own features. However, an increment of the accuracy should be done in order to reach better classification (86.83%). There are several approaches to achieve that improvement: Increase the dataset with more samples, apply more data augmentation or even use a hierarchy model architecture.

Next steps

We will load the model and build a python API Rest using Flask where we can pass an image in the request and get its classification.

NOTE: The code of the model is located in https://github.com/joangoal8/deep-learning-conv-tutorial

Email: joangoal8@gmail.com

LinkedIn: www.linkedin.com/in/joangomezalvarez

Instagram: @joangoal8

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[3] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du- mitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357v2, 2016.

[6] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[7] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M.Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[8] Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018.

[9] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.

[10] M. Tan, Q. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Learning Representations, 2019.

[11] Wikipedia commons. https://commons.wikimedia.org/. Accessed on 2021–06–20

[12] Kaggle.com Brain MRI images for brain tumor detection https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection

[13] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. First Workshop on Fine-Grained Visual Categorization (FGVC), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

--

--