From ANN to Convolutional Neural Network (CNN): a new architecture for Computer Vision tasks

Salvi Elisa
7 min readNov 25, 2022

--

INTRO AND HISTORY:

In more recent years, Computer Vision researchers have begun to utilize Artificial Neural Networks inspired by human vision to processes visual data and to understand how pixels relates to another. Artificial Neural Networks (ANNs) comprise mathematical models composed of artificial neurons based on the biological functioning of the human brain. The human brain is composed of Neural Networks with interconnected biological neurons. “Neurons” are the basic unit of a ANN and are statistical functions able to process a huge amount of information, as the human brain does. Neurons enable each individual to reason, make in parallel processing such as recognizing sounds, images, faces or even learning and so on. Neural Networks are a valuable paradigm for solving Machine Learning engineering problems and require advanced hardware to support them.

The first idea of Artificial Neural Network was initially developed by the neurophysiologist W. McCulloch and the logician W. Pitts in 1943. In 1958 the psychologist F. Rosenblatt came up with the concept of “perceptron”, that is an algorithm for learning a binary classifier, in other words, is a threshold function that maps a real-valued vector as a single binary value. In 1967 the mathematician Ivakhnenko and Lapa published a paper about functional networks with many layers for the first time. In those years, the researches proceeded slowly, but after the first AI winter, the American Institute of Physics established the “Neural Networks in Computing” annual meeting from 1985. However, it was only after the second AI winter that ANN reached the expectation and nowadays is still a hot topic in the Computer Vision research field.

ANN ARCHITECTURE:

In traditional Machine Learning, programmers design feature extractors to create learning algorithms. In contrast, Neural Networks are composed of multiple, sequential layers that are directly fed with raw data and can develop on their own the representations needed for pattern recognition. In this way, Neural Networks result to have a strong performance in the classification of objects contained in an image. Specifically, ANNs are composed of 3 types of layers, that are an assembled of neurons, as shown in the figure below:

  • The input layer: is made of nodes, which process the input vector’s values and feeds them into the dense, hidden layers.
  • The hidden layers: are fully connected layers, meaning that a single node in one layer is connected to all nodes in the next layer via a series of channels. Input values are transmitted forward until they reach the output layer.
  • The output layer: comprises nodes which represent the target classes, for the classification task, and for each classes is predicted the percentage of membership.
Neural Network structure.
Neural Network structure.

CONVOLUTIONAL NEURAL NETWORK:

“Convolutional Neural Network”, or “CNNs”, is a field of Artificial Neural Network and is an algorithm that can process and classify images using a multi-layer Artificial Neural Network. CNNs are mainly used to analyze visual imagery, in fact the first layer takes as input a 3-dimensional volume, that corresponds to a RGB tensor. The structure of a CNN can be divided as:

Convolutional Neural Network structure.
Convolutional Neural Network structure.
  • The input layer: its role in a CNNs is to break down the input images into an easier form to process, without losing features which is critical for achieving a good prediction. This layer has the same number of neurons as number of pixels in the image for each of the RGB channels.
  • The Convolutional layers: are usually made of 32 weighted matrices, called Filters or Kernels, which are defined by their width, height, and depth. Convolutional layers apply filters to the original image in order to extract high-level features. The first convolutional layer captures the shape of the edges, the colors, the gradient orientation, and so on. This type of features is called “low-level features”. Moving forward with additional layers, increasingly higher-level features are found. In this way, the model has a complete understanding of the input images. One of the main disadvantages of CNN is the interpretability of the feature extracted. Many feature, especially high-level features, are not interpretable at all, as explained by Molnar at al. in his book. Moreover, only some low-level feature can be visualized in the training process. The convolutional layers result in a 3-dimensional tensor, composed of a 2-dimensional map per filter. The latter is fed into an elementwise activation function that determines if a node will fire or not given the input data. In other words, the signal from the previous cell is processed and converted in a suitable form for the next layer, discarding the irrelevant information. An activation function is a non-linear transformation. The most used activation function are the ReLU, the Sigmoid, the SoftMax and the hyperbolic tangent. In the case of a Convolutional layer, the ReLU, or Rectified Linear Unit, function is the most widely used because is the only one that convert to zero negative input and is formulated as max(0, x). In this way, the neurons are not activated all at the same time, so the model results to be sparse and more efficient.
Sigmoid and hyperbolic tangent formulas.
Sigmoid and hyperbolic tangent formulas.
  • The Pooling layer: is a down-sampling operation for bidimensional spatial data. It is important for decreasing the dimensions of the data propagating through the network. This is made extracting the maximum or the average of an area using a sliding window approach. Moreover, Pooling is crucial for object recognition tasks because is able to provide the spatial variance of an image.
  • Convolutional and Pooling layers are repeated to extract higher-level features.
  • The Dense layer is a fully connected layer. This means that all the neurons in a layer are connected to those in the next layer. In the Dense layer the information is aggregated and the results are flattened before the classification in N groups, so the output is a N-dimensional vector. This is made performing matrix-vector multiplication and the matrix parameters are updated with back-propagation. An important issue is that the back-propagation requires a huge amount of RAM, especially during the training, because all the intermediate values computed during the forward pass are stored. Finally, the SoftMax activation function is used to convert the output of the neuron into the membership probability of each attribute. The SoftMax function is a type of Sigmoid function that can handle classification tasks ending up with the probabilities of membership for each class. It is formulated as:
SoftMax activation function.
SoftMax activation function.
  • The Output layer: is a Dense layer with a number of neurons that corresponds to the number of target classes and is composed of the probability of belonging to each category.

CNNs are difficult to apply to high resolution images, but to face this problem Krizhevsky at al. developed a CNN called “AlexNet” for large scale visual recognition in 2012 studying GPUs optimization. The starting dataset for this challenge is called “ImageNet” and contains about 15 million high-resolution images belonging to 22000 classes. AlexNet architecture is composed of eight layers, including convolutional layers, max-pooling layers and fully connected layers characterized by a ReLU activation function, that results in a better performance than other functions. This network became popular because it achieved a very good accuracy with an error of only 15, 3%. Researchers explained that using GPUs was crucial for high performance with feasible computational costs.

To sum up, the core idea of a Neural Network is tuning the parameters in order to map the input, an image in the case of a CNN, to an output, such as a label. The number of parameters is usually in the order of millions, so a large amount of data is required as input to achieve a good performance. In real world scenarios, the lack of data available is a crucial issue for Deep Learning algorithms. To face this problem, Image Augmentation is often performed, see my article. Thanks to this technique, minor alterations, such as flipping, translations, cropping or rotations, are performed on existing data. In this way, the amount of relevant data is increased and the CNN results to be invariant to translation, size and illumination and, as such, more robust.

An advantage is that CNNs do not need pre-processing techniques, except to Image Augmentation. CNNs are able to adapt to images during the training process without human intervention. On the other hand, often CNNs are used as supervised learning algorithm, meaning that images are labeled before the training. The labeling process is called image annotation and usually requires the human intervention. In this way, the class of membership is known for every input image. This is fundamental because enables the model to evaluate errors in the classification.

Convolutional Neural Networks are widely diffused in Computer Vision, especially in the medical field. For instance, CNNs are used for the segmentation of anatomical tissues images, such as for the prevention of brain tumors. Moreover, CNNs can be applied not only to simple images, but also for video processing and volumetric image processing.

BIBLIOGRAPHY

  1. Grace W Lindsay. Convolutional neural networks as a model of the visual sys-tem: Past, present, and future. Journal of cognitive neuroscience, 33(10):2017– 2031, 2021.
  2. Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
  3. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  4. Aleksei Grigorevich Ivakhnenko, A G Ivakhnenko, Valentin Grigorevich Lapa, and Valentin Grigorevich Lapa. Cybernetics and forecasting techniques, volume 8. American Elsevier Publishing Company, 1967.
  5. Christoph Molnar. A guide for making black box models explainable. URL: https://christophm. github. io/interpretable-ml-book, 2018.
  6. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

--

--

Salvi Elisa

Mathematician and Data Scientist 👩‍💻 #AI #DeepLearning #ComputerVision