Essentials of Convolutional Neural Networks with LeNet, AlexNet, VGG, GoogleNet, and ResNet

mrgrhn
The Startup
Published in
9 min readJan 22, 2021

Convolutional Neural Networks or ConvNets or even in shorter CNNs are a family of neural networks that are commonly implemented in computer vision tasks, however the use cases are not limited to that. Today, CNNs are employed in deep learning research and practical artificial intelligence applications with interests ranging from autonomous driving to medical imaging.

Before diving into the fundamental glossary and the chronological evolution of CNNs, I will rewind the tape back to the mid-1900s and briefly mention the story of human-like learning attempts in computer science.

A Very Brief History of Neural Networks

The first model to mimic the biological nervous system is developed by Frank Rosenblatt back in 1958. Consisting of a single neuron named perceptron[1] it was only capable of identifying linearly separable samples in a binary classification problem. The underlying principle was the Hebbian Learning[2] which can be summarized as “Cells that fire together, wire together”. This motto means that when two cells are having high activations simultaneously, the connection between them (the weight links them in artificial networks) should be strengthened. However, due to its low capacity of learning, the findings on perceptron did not draw much attention. In 1982, Hopfield published his study[3] on a type of network that serves as a content-addressable memory. Being the first popular recurrent neural network, it is used for optimization problems. Even though back-propagation[4] was first formulated in 1963, it is reformulated for supporting Multi-Layer Perceptrons (MLPs) by Rumelhart et al. in 1986[5]. The two decades in between did not really witness any major improvements in the context of deep learning research. While there was a bigger excitement in the community thanks to MLPS in the ’80s, there were still many open problems to be solved such as overfitting, being stuck in the local minima, lack of capable hardware for bigger computations, and many more. To top it all, the huge success of Support Vector Machines[6] in the 90s undermined the promising ideas of neural networks.

Notable CNN Models

The long-waited breakthrough came in 1998 by Yann LeCun as he introduced convolutional neural networks to the literature by releasing LeNet[7]. Convolution, weight sharing, and subsampling were all new ideas back in time and this ignited lots of research in computer vision.

However, CNNs required an abundance of data to be trained successfully and this need is met by the ImageNet[8] dataset containing more than 3.2 million annotated images in 2009. Yet, more data required more computational power and in 2012, AlexNet[9] became the first CNN to be trained using NVIDIA GeForce 256 GPU. Other key properties of AlexNet was the dropout layer that performs a means of regularization during training and ReLU nonlinear activation.

In 2014 ILSVRC of ImageNet hosted two very competitive models and the runner-up was VGGNet[10] with its deeper architecture and 1x1 convolutions for the purpose of reducing the dimensionality of the feature maps.

The winner of ILSVRC-2014 was GoogLeNet[11] that was developed by a group of computer scientists from Google. Thanks to its branched structure and the auxiliary losses during the training, they managed to design and train a deeper model and obtained a greater learning capacity.

A more recent winner of the challenge was ResNet in 2016 and it is absolutely worth mentioning for its residual learning mechanism with skip connections. They developed a very smart solution to tackle the vanishing gradient problem by bypassing certain layers with identity connections in order to reduce the number of operations the gradient is subjected to.

These five CNNs are the most fundamental models that changed the paradigm in computer vision research. Today CNNs are utilized in many domains, even on non-pixel data. UNet[14], YOLO[15], SSD[16], and R-CNN[17] are also among significant architectures and pipelines employing convolutional neural networks.

A Glossary of Essential CNN Concepts

Convolution:

Convolution is a very common mathematical operation in signal processing regarding linear time-invariant systems, yet its wide use cases have become popular with the advance of CNNs. In 2-D convolutions, a small mask (usually 3x3, not bigger than 11x11) hovers over the input image and makes an element-wise multiplication between its elements (weights) and covered pixels of the image. The sum of the multiplications is transferred to the next layer.

A visualization of the convolution operation (Animation is taken from[18])

Weight sharing:

In fully connected networks, all the neurons in the consecutive layers are connected to each other and the corresponding weights are updated independently. For instance, between two layers having 100 and 64 neurons, there are 6400 weights and the model is ought to train all of them. However, in CNNs, masks having certain sizes go through the images in two (or more if it is convolutions with more dimensions) directions and the weights are embedded in those masks.

A visualization of the weight sharing (Chart is taken from[19])

Weight sharing leverages the fact that these masks are looking for common patterns in the image such as edges and corners for low-level motifs or eyes and noses in facial recognition for high-level motifs. Since the same parameters are used for distant locations of the image, the number of trainable parameters is significantly lower compared to fully connected networks.

Stride:

Stride is the step size of the masks (or kernels, filters) as they go through the input image by performing convolution. A smaller stride means more overlap of the receptive fields for the next layers, thus more textual information among the neighboring pixels. A greater stride is used for reducing the image size, which happens at the later layers of a CNN.

Padding:

Padding is the procedure of adjusting the image sizes such that the sizes would be preserved or the sizes become integer numbers before the next layer. Padding is usually done with zeros and in a symmetric way both in height and width. It also lets the masks fit into the images at the edges.

A visualization of the convolution operation (Animation is taken from[20])

Above you can see a very explanatory animation visualizing the concepts above.

Subsampling:

CNN models learn patterns in visual data in a hierarchical manner. In the earlier layers, the masks detect small details and low-level features. As the propagation continues, maks be able to detect high-level features, however to do so they need to cover a larger area. Instead of creating larger masks, the images get shrink by subsampling. Average pooling takes the average of a designated area and conveys it to the next layer whereas max-pooling takes the maximum activation for the sake of stressing the biggest activation.

A visualization of the max-pooling operation

Receptive fields:

The receptive field means the number of neurons, a neuron in the following layers receives information from. In 3x3 kernels, each neuron receives information from 9 neurons belonging to the previous edge. If we assume stride = 1, the same neuron collects the information of 25 neurons in the layer two steps previous. You can see below why it is true:

A visualization of the max-pooling operation

Dilated convolutions:

While deciding on the size of the filters in CNN layers a trade-off emerges between the ability to obtain details with small filters and the ability to capture the texture with larger filters. Dilated convolutions offer a method to support the exponential expansion of receptive fields through the layers without losing resolution or coverage[21].

2D convolution using a 3x3 kernel with a dilation rate of 2 and no padding[22]

Vanishing gradients:

During the backpropagation in CNNs and other mainstream neural networks, as the gradient is being flowed backward it is multiplied by the layer weights until it reaches the layer that we want to update. During its journey from the output layer, it gets multiplied by several weights most of them being very small numbers, in deep networks. As the multiplications grow cumulatively, the gradient gets smaller and eventually vanishes with underflow. This is a common problem of deep networks and encountered especially in recurrent neural networks. As a result of the vanishing gradients, the learning becomes very slow in the initial layers. ResNet eliminated this problem by directing the gradient through skip connections.

The MNIST Dataset:

The MNIST Dataset consists of 60.000 training and 10.000 test images each having a handwritten digit, size-normalized and centered in a fixed-size image. All the images have a label from 0 to 9, indicating the handwritten digit of the specific image. Images have 28 pixels of width and 28 pixels of height. It is a very practical dataset with its small size in the memory and easy preprocessing. Many models are benchmarked on MNIST as the first step of development.

MNIST digit samples

ImageNet:

ImageNet is one of the earliest examples of large-scale image datasets for the use of computer vision tasks such as object recognition and image classification. Initially having around 3.2 million images, ImageNet now has more than 14 million images with 20 thousand distinct labels. ImageNet has been a milestone in computer vision research by providing the desired data to train deep models. Many research groups still benchmark their networks on ImageNet. Moreover, ImageNet Large Scale Visual Recognition Challenge or ILSVRC had been organized between 2010 and 2017 in different categories. The models that are explained in the other posts of this blog and mentioned above were significant models of this competition.

References:

  1. Rosenblatt, F. “The perceptron: a probabilistic model for information storage and organization in the brain”. (1958). Psychological Review 65 6: 386–408.
  2. Hebb, Donald O. (1949). The Organization of Behavior. New York: Wiley, pg. 62.
  3. Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities”. Proceedings of the National Academy of Sciences. 79 (8): 2554–2558.
  4. Bryson, Arthur E. (1962). “A gradient method for optimizing multi-stage allocation processes”. Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.
  5. Rumelhart, D., Hinton, G. & Williams, R. (1986). “Learning representations by back-propagating errors”. Nature 323, 533–536.
  6. Cortes, Corinna & Vapnik, Vladimir N. (1995). “Support-vector networks”. Machine Learning. 20 (3): 273–297.
  7. Lecun, Yann (June 1989). “Generalization and network design strategies”. Technical Report CRG-TR-89–4. Department of Computer Science, University of Toronto.
  8. Deng, Jia & Dong, Wei & Socher, Richard & Li, Li-Jia & Li, Kai & Li, Fei Fei. (2009). “ImageNet: a Large-Scale Hierarchical Image Database”. IEEE Conference on Computer Vision and Pattern Recognition. 248–255. 10.1109/CVPR.2009.5206848.
  9. Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). “ImageNet Classification with Deep Convolutional Neural Networks”. Neural Information Processing Systems. 25. 10.1145/3065386.
  10. Simonyan, Karen & Zisserman, Andrew. (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 1409.1556.
  11. Szegedy, Christian & Liu, Wei & Jia, Yangqing & Sermanet, Pierre & Reed, Scott & Anguelov, Dragomir & Erhan, Dumitru & Vanhoucke, Vincent & Rabinovich, Andrew. (2014). “Going Deeper with Convolutions”.
  12. He, Kaiming & Zhang, Xiangyu & Ren, Shaoqing & Sun, Jian. (2016). “Deep Residual Learning for Image Recognition”. 770–778. 10.1109/CVPR.2016.90.
  13. http://www.cs.cmu.edu/~10701/slides/Perceptron_Reading_Material.pdf
  14. Ronneberger, Olaf & Fischer, Philipp & Brox, Thomas. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation”. LNCS. 9351. 234–241. 10.1007/978–3–319–24574–4_28.
  15. Redmon, Joseph & Divvala, Santosh & Girshick, Ross & Farhadi, Ali. (2016). “You Only Look Once: Unified, Real-Time Object Detection”. 779–788. 10.1109/CVPR.2016.91.
  16. Liu, Wei & Anguelov, Dragomir & Erhan, Dumitru & Szegedy, Christian & Reed, Scott & Fu, Cheng-Yang & Berg, Alexander. (2016). “SSD: Single Shot MultiBox Detector”. 9905. 21–37. 10.1007/978–3–319–46448–0_2.
  17. Girshick, Ross & Donahue, Jeff & Darrell, Trevor & Malik, Jitendra. (2013). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 10.1109/CVPR.2014.81.
  18. https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1
  19. Abdel-Hamid, Ossama & Deng, li & Yu, Dong. (2013). “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition”.
  20. https://cs231n.github.io/assets/conv-demo/index.html
  21. Yu, Fisher & Koltun, Vladlen. (2015). “Multi-Scale Context Aggregation by Dilated Convolutions”.
  22. https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

--

--

mrgrhn
The Startup

Boğaziçi Üniversitesi ’20 Electrical & Electronics Engineering — Physics | Articles on various Deep Learning topics