An overview of the architectures of convolutional neural networks: History and description of convolutional neural networks (CNN)

Mike Vas
Nerd For Tech
Published in
9 min readApr 16, 2021

--

Photo by Patrick Schneider on Unsplash

History of convolutional neural networks

The use of computers for so-called computer vision (CV) is thought to have become widespread in 2012 when the AlexNet architecture won the ImageNet competition, however, attempts to use machines to recognize shapes in images have existed for decades.

The events of the fifties and sixties of the twentieth century are considered a turning point in the use of computer vision. In 1959, two neuropsychologists, David Hubel and Thorsten Weasle published a paper entitled “Receptive fields of single neurons in the cat’s striate cortex”, in which they concluded that visual processing always starts with simple structures such as oriented edges. [4]

Larry Roberts considered the father of this field, described in 1963 the possibility of extracting information about the 3D geometry of objects based on different 2D views in his work “Block World”. This work is important because it corresponds to human nature. Human beings are able to recognize a shape regardless of its orientation or changes in the light source. And his work is based on the importance of recognizing the edges of objects. After this, in 1970, the first algorithm for visual recognition was published, called the Generalized Cylindrical Model, and it was published by the Laboratory for Artificial Intelligence from Stanford University. And at the core of this idea is that the world consists of simple shapes and that every object in the world is a combination of these simple shapes. The next algorithm is based on a generalized cylindrical model but introduces strings as connections between parts and thus the concept of variability was the Image Model of Structure. [5]

Description of convolutional neural networks

Convolutional neural networks are a special type of multilayer neural networks whose primary function is to recognize visual patterns in images with minimal processing. Like the previously described types of neural networks, this type of network also consists of neurons with appropriate weights. [5]

As a reminder, the aforementioned neural networks have a single input vector that propagates through hidden layers. The characteristic of these networks is that each hidden layer consists of neurons that are connected to all neurons from the previous layer (fully connected). Within one layer, each neuron is independent and has no connection with neurons from that layer. And finally, there is an output layer which, in the example of images, most often represents classes. [5]

The problem of classification within an object recognition system can be reduced to one of the following two ideas:

Figure 1: Clearly separated feature extractor and classifier

Where “Feature extractions” means a parameterized function that is efficiently calculated and enables efficient learning of characteristics. In this idea, the parts for separating the characteristics and the classifier are clearly separated.

The second idea encompasses that there is no clear boundary between these elements, that everything represents a single nonlinear system that is trained from raw pixels to final classification.

The question that arises is how to achieve this? The solution is to combine simple functions to form more complex ones. For example, using the functions sin (x), cos (x), x3, exp (x), etc. We can form more complex functions such as: sin (exp (log (x3))). This composition of functions is the basis of deep learning. [6]

The simplest is to graphically represent the intuition behind deep learning. If we look at the following illustration we can see that in fact every black rectangle can have some of the training parameters, the composition of these rectangles, simple functions, forms a complex nonlinear system. The exit from each of the rectangles is actually a current representation of what is being classified, that is, some characteristics. [6]

Figure 2: Deep learning feature extraction

Convolutional neural networks usually have a third-order input tensor. Tensor means generalization of vectors and their mathematical representation is a matrix. The tensor can be a 1D matrix (vector), a 3D matrix, and even a 0D matrix (one number), but also any other multidimensional structure that is difficult to visualize.

An example of such an input would be, for example, an image that has a height H, a width W, and has 3 color channels (red, green, and blue). The entrance to this type of neural network will be any other multi-dimensional tensor, the learning process does not change in essence. The input goes through a series of process steps, where each of the steps is called a layer. Each layer can be convolutional, a layer for reducing the spatial size of the representation (pooling), a layer for normalization, a fully connected layer, a layer for calculating losses, etc. [7]

Convolutional layer

Examples of image representation and feature detectors can be seen in Figure 3.

Figure 3: On the left is an example of a black and white image, in the middle is an example of a feature detector, on the right is the resulting image

In the example, the characteristic detector (filter, kernel) is 3x3. Filter sizes vary from architecture to architecture, the following chapters will show different filter sizes.

This filter is applied to the image by, for example, capturing the first nine pixels of the original image (marked in red) and multiplying each included pixel of the image by its corresponding value in the filter. The multiplication result is entered as the pixel value in the resulting image. The stride that moves the filter across the image determines the number of pixels in the resulting image. For example, if we move 1 pixel to the right, it means that we will be able to move the filter six times in one row of the initial image, so that the resulting image will have 6 pixels in the first row, compared to eight in the original image. If the filter is applied to pixels in a green rectangle, we see one match, so the value of that pixel in the row will be appropriate. The procedure is repeated until the whole picture is passed. The value of the step depends on the needs, the result is an image of smaller dimensions, for example for step two, the image will be twice smaller in height and width. This certainly has a positive effect on the consumption of computer resources. The question that arises is do we lose information? Of course. The number of pixels in the output image is less than the number of pixels in the input, a certain amount of information is lost, but the purpose of the filter is to find certain characteristics in the original image, so what is lost is probably not the task of the filter to extract. what was left was exactly the characteristics that the filter was supposed to reveal. It should be noted that when moving the filter, it can happen that the filter exceeds the image boundaries, for example from Figure 4, if we take the filter and move it by step two, it is noticed that the third column of the filter will not cover any pixels of the original image. This problem is solved in two ways. The first way is to ignore simple pixels, if it is acceptable to lose that information this is usually referred to as valid padding. Another way is to expand the original image with missing pixels and this is most often referred to as the padding itself.

In picture no. 4 if we look at the convolutional layer, we can see that there are more filters, in this particular example there are six. The existence of several filters is in direct correlation with the previous part of the text, therefore, each of the filters has the task of finding certain characteristics, for example, the first filter looks only for horizontal lines, the second only for vertical ones, etc. In this way, a large part of the information of the original image is preserved and the loss is reduced, ie. It is optimized to a certain extent since only information that is more important for solving the problem is stored.

Figure 4: General architecture of convolutional neural networks

Figure 4 is missing one step. This is ReLU (Rectified linear unit — rectifier):

Figure 5: ReLU

It is an activation function whose task is to increase nonlinearity. Increasing nonlinearity is introduced because images are non-linear in nature, especially if there are a large number of overlapping objects in the image. Images have a large number of nonlinear elements, and the transitions between adjacent pixels are in many cases nonlinear. However, by introducing filters, there is a possibility that we lose this nonlinearity, that is, that we create linear transitions, for example, which do not correspond to the original image. The goal is then to escape as much as possible into nonlinearity, that is, to “correct” the introduced linearity. [10]

Max-pooling

In Figure 4, the existence of a max-pooling layer can be observed. This type of layer has the task of reducing the spatial representation of the image. As mentioned above, the convolutional layer is actually a bunch of feature maps (original images over which the appropriate filter has been applied). Increasing the number of filters increases the dimensionality, which affects the increase in the number of parameters that need to be adjusted when training the network. The pooling layer has the task of controlling overfitting by reducing the spatial size of the representation, and thus the number of parameters. The input to this layer is the most common output from the convolutional layer. Max-pooling is most often used, but average pooling has also found its application in a large number of cases. [5]

Pooling involves selecting the appropriate operation, then a filter that is usually smaller than the feature map over which it is applied, some commonly used value is the 2x2 filter with step 2. This means that it will halve the height and width value in the appropriate feature map. Max pooling works like applying the filter from Figure 4, the filter of this layer is dragged over the image with the appropriate step and the appropriate filling of the missing filters, the maximum value in the window is selected and entered in the resulting map of characteristics of this layer. The situation with Average pooling is the same, except that instead of the maximum, the average value of the pixels that are currently being filtered is sought.

Figure 6: Example of max and average pooling

The next step in training is flattening. In order for everything that has been done so far to be able to be passed on to the neural network for training, it is necessary to form a vector. And that is the meaning of this step. [10]

Figure 7: Forming vector of matrix

After this, the rest of the network usually coincides with the MLP, completely connected layers are formed and the appropriate classification is performed.

Dropout

When it comes to fully connected layers, they generally contain the largest number of parameters, and accordingly the neurons in these layers, during training, build interdependencies that affect the impact of each of the neurons, which can lead to network overtraining. Dropout is a technique that should help solve this problem. The idea is to exclude a certain number of nodes, ie. not to participate in any training step during the advance propagation phase. In this way, the convergence of weights towards the same values ​​is prevented. In addition to excluding individual neurons, it is possible to exclude the complete layer.

Part I: Deep neural networks
Next sequel soon: Image augmentation and transfer learning

Resources

[1] https://www.mathworks.com/discovery/deep-learning.html
[2] https://skymind.com/wiki/multilayer-perceptron
[3] https://keras.io/
[4] https://hackernoon.com/a-brief-history-of-computer-vision-and-convolutionalneural-networks-8fe8aacc79f3
[5] Practical Convolutiona Neural Networks, Mohit Sewak, Md. Rezaul Karim, Pradeep Pujari, ISBN 978–1–78839–230–3, Pack Publishing Ltd.
[6]https://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/tutorial_p2_nnets_ranzat o_short.pdf
[7] https://cs.nju.edu.cn/wujx/teaching/15_CNN.pdf
[8] https://www.mdpi.com/2079-9292/8/3/292/htm
[9] https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
[10] https://www.superdatascience.com/pages/deep-learning
[11] https://engmrk.com/lenet-5-a-classic-cnn-architecture/
[12] https://medium.com/@smallfishbigsea/a-walk-through-of-alexnet6cbd137a5637
[13]https://medium.com/@pechyonkin/key-deep-learning-architectures-alexnet30bf607595f1
[14]https://prateekvjoshi.com/2016/04/05/what-is-local-response-normalization-inconvolutional-neural-networks/
[15] https://papers.nips.cc/paper/4824-imagenet-classification-with-deepconvolutional-neural-networks.pdf
[16]https://medium.com/coinmonks/paper-review-of-vggnet-1st-runner-up-of-ilsvlc2014-image-classification-d02355543a11
[17] https://medium.com/coinmonks/paper-review-of-googlenet-inception-v1- winner-of-ilsvlc-2014-image-classification-c2b3565a64e7
[18] https://machinelearningmastery.com/use-pre-trained-vgg-model-classifyobjects-photographs/
[19] https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000 34
[20] https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transferlearning-with-real-world-applications-in-deep-learning-212bf3b2f27a

--

--

Mike Vas
Nerd For Tech

Software Engineer interested in web and mobile development and ML. IG: https://www.instagram.com/mikevastech/ TW: @mikevastech, GitHub: mikevatech