Essentials of Convolutional Neural Networks with LeNet, AlexNet, VGG, GoogleNet, and ResNet
Convolutional Neural Networks or ConvNets or even in shorter CNNs are a family of neural networks that are commonly implemented in computer vision tasks, however the use cases are not limited to that. Today, CNNs are employed in deep learning research and practical artificial intelligence applications with interests ranging from autonomous driving to medical imaging.
Before diving into the fundamental glossary and the chronological evolution of CNNs, I will rewind the tape back to the mid-1900s and briefly mention the story of human-like learning attempts in computer science.
A Very Brief History of Neural Networks
The first model to mimic the biological nervous system is developed by Frank Rosenblatt back in 1958. Consisting of a single neuron named perceptron it was only capable of identifying linearly separable samples in a binary classification problem. The underlying principle was the Hebbian Learning which can be summarized as “Cells that fire together, wire together”. This motto means that when two cells are having high activations simultaneously, the connection between them (the weight links them in artificial networks) should be strengthened. However, due to its low capacity of learning, the findings on perceptron did not draw much attention. In 1982, Hopfield published his study on a type of network that serves as a content-addressable memory. Being the first popular recurrent neural network, it is used for optimization problems. Even though back-propagation was first formulated in 1963, it is reformulated for supporting Multi-Layer Perceptrons (MLPs) by Rumelhart et al. in 1986. The two decades in between did not really witness any major improvements in the context of deep learning research. While there was a bigger excitement in the community thanks to MLPS in the ’80s, there were still many open problems to be solved such as overfitting, being stuck in the local minima, lack of capable hardware for bigger computations, and many more. To top it all, the huge success of Support Vector Machines in the 90s undermined the promising ideas of neural networks.
Notable CNN Models
The long-waited breakthrough came in 1998 by Yann LeCun as he introduced convolutional neural networks to the literature by releasing LeNet. Convolution, weight sharing, and subsampling were all new ideas back in time and this ignited lots of research in computer vision.
LeNet with TensorFlow
LeNet is considered to be the ancestor of convolutional neural networks and is a well-known model among the computer…
However, CNNs required an abundance of data to be trained successfully and this need is met by the ImageNet dataset containing more than 3.2 million annotated images in 2009. Yet, more data required more computational power and in 2012, AlexNet became the first CNN to be trained using NVIDIA GeForce 256 GPU. Other key properties of AlexNet was the dropout layer that performs a means of regularization during training and ReLU nonlinear activation.
AlexNet with TensorFlow
AlexNet is an important milestone in the visual recognition tasks in terms of available hardware utilization and…
In 2014 ILSVRC of ImageNet hosted two very competitive models and the runner-up was VGGNet with its deeper architecture and 1x1 convolutions for the purpose of reducing the dimensionality of the feature maps.
VGGNet with TensorFlow (Transfer Learning with VGG16 Included)
VGG owes its name to the Visual Geometry Group of Oxford University. After being submitted to ILSVRC in 2014, the…
The winner of ILSVRC-2014 was GoogLeNet that was developed by a group of computer scientists from Google. Thanks to its branched structure and the auxiliary losses during the training, they managed to design and train a deeper model and obtained a greater learning capacity.
GoogLeNet (InceptionV1) with TensorFlow
InceptionV1 or with a more remarkable name GoogLeNet is one of the most successful models of the earlier years of…
A more recent winner of the challenge was ResNet in 2016 and it is absolutely worth mentioning for its residual learning mechanism with skip connections. They developed a very smart solution to tackle the vanishing gradient problem by bypassing certain layers with identity connections in order to reduce the number of operations the gradient is subjected to.
ResNet with TensorFlow (Transfer Learning)
ResNet owes its name to its residual blocks with skip connections that enable the model to be extremely deep. Even…
These five CNNs are the most fundamental models that changed the paradigm in computer vision research. Today CNNs are utilized in many domains, even on non-pixel data. UNet, YOLO, SSD, and R-CNN are also among significant architectures and pipelines employing convolutional neural networks.
A Glossary of Essential CNN Concepts
Convolution is a very common mathematical operation in signal processing regarding linear time-invariant systems, yet its wide use cases have become popular with the advance of CNNs. In 2-D convolutions, a small mask (usually 3x3, not bigger than 11x11) hovers over the input image and makes an element-wise multiplication between its elements (weights) and covered pixels of the image. The sum of the multiplications is transferred to the next layer.
In fully connected networks, all the neurons in the consecutive layers are connected to each other and the corresponding weights are updated independently. For instance, between two layers having 100 and 64 neurons, there are 6400 weights and the model is ought to train all of them. However, in CNNs, masks having certain sizes go through the images in two (or more if it is convolutions with more dimensions) directions and the weights are embedded in those masks.
Weight sharing leverages the fact that these masks are looking for common patterns in the image such as edges and corners for low-level motifs or eyes and noses in facial recognition for high-level motifs. Since the same parameters are used for distant locations of the image, the number of trainable parameters is significantly lower compared to fully connected networks.
Stride is the step size of the masks (or kernels, filters) as they go through the input image by performing convolution. A smaller stride means more overlap of the receptive fields for the next layers, thus more textual information among the neighboring pixels. A greater stride is used for reducing the image size, which happens at the later layers of a CNN.
Padding is the procedure of adjusting the image sizes such that the sizes would be preserved or the sizes become integer numbers before the next layer. Padding is usually done with zeros and in a symmetric way both in height and width. It also lets the masks fit into the images at the edges.
Above you can see a very explanatory animation visualizing the concepts above.
CNN models learn patterns in visual data in a hierarchical manner. In the earlier layers, the masks detect small details and low-level features. As the propagation continues, maks be able to detect high-level features, however to do so they need to cover a larger area. Instead of creating larger masks, the images get shrink by subsampling. Average pooling takes the average of a designated area and conveys it to the next layer whereas max-pooling takes the maximum activation for the sake of stressing the biggest activation.
The receptive field means the number of neurons, a neuron in the following layers receives information from. In 3x3 kernels, each neuron receives information from 9 neurons belonging to the previous edge. If we assume stride = 1, the same neuron collects the information of 25 neurons in the layer two steps previous. You can see below why it is true:
While deciding on the size of the filters in CNN layers a trade-off emerges between the ability to obtain details with small filters and the ability to capture the texture with larger filters. Dilated convolutions offer a method to support the exponential expansion of receptive fields through the layers without losing resolution or coverage.
During the backpropagation in CNNs and other mainstream neural networks, as the gradient is being flowed backward it is multiplied by the layer weights until it reaches the layer that we want to update. During its journey from the output layer, it gets multiplied by several weights most of them being very small numbers, in deep networks. As the multiplications grow cumulatively, the gradient gets smaller and eventually vanishes with underflow. This is a common problem of deep networks and encountered especially in recurrent neural networks. As a result of the vanishing gradients, the learning becomes very slow in the initial layers. ResNet eliminated this problem by directing the gradient through skip connections.
The MNIST Dataset:
The MNIST Dataset consists of 60.000 training and 10.000 test images each having a handwritten digit, size-normalized and centered in a fixed-size image. All the images have a label from 0 to 9, indicating the handwritten digit of the specific image. Images have 28 pixels of width and 28 pixels of height. It is a very practical dataset with its small size in the memory and easy preprocessing. Many models are benchmarked on MNIST as the first step of development.
ImageNet is one of the earliest examples of large-scale image datasets for the use of computer vision tasks such as object recognition and image classification. Initially having around 3.2 million images, ImageNet now has more than 14 million images with 20 thousand distinct labels. ImageNet has been a milestone in computer vision research by providing the desired data to train deep models. Many research groups still benchmark their networks on ImageNet. Moreover, ImageNet Large Scale Visual Recognition Challenge or ILSVRC had been organized between 2010 and 2017 in different categories. The models that are explained in the other posts of this blog and mentioned above were significant models of this competition.
- Rosenblatt, F. “The perceptron: a probabilistic model for information storage and organization in the brain”. (1958). Psychological Review 65 6: 386–408.
- Hebb, Donald O. (1949). The Organization of Behavior. New York: Wiley, pg. 62.
- Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities”. Proceedings of the National Academy of Sciences. 79 (8): 2554–2558.
- Bryson, Arthur E. (1962). “A gradient method for optimizing multi-stage allocation processes”. Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.
- Rumelhart, D., Hinton, G. & Williams, R. (1986). “Learning representations by back-propagating errors”. Nature 323, 533–536.
- Cortes, Corinna & Vapnik, Vladimir N. (1995). “Support-vector networks”. Machine Learning. 20 (3): 273–297.
- Lecun, Yann (June 1989). “Generalization and network design strategies”. Technical Report CRG-TR-89–4. Department of Computer Science, University of Toronto.
- Deng, Jia & Dong, Wei & Socher, Richard & Li, Li-Jia & Li, Kai & Li, Fei Fei. (2009). “ImageNet: a Large-Scale Hierarchical Image Database”. IEEE Conference on Computer Vision and Pattern Recognition. 248–255. 10.1109/CVPR.2009.5206848.
- Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). “ImageNet Classification with Deep Convolutional Neural Networks”. Neural Information Processing Systems. 25. 10.1145/3065386.
- Simonyan, Karen & Zisserman, Andrew. (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv 1409.1556.
- Szegedy, Christian & Liu, Wei & Jia, Yangqing & Sermanet, Pierre & Reed, Scott & Anguelov, Dragomir & Erhan, Dumitru & Vanhoucke, Vincent & Rabinovich, Andrew. (2014). “Going Deeper with Convolutions”.
- He, Kaiming & Zhang, Xiangyu & Ren, Shaoqing & Sun, Jian. (2016). “Deep Residual Learning for Image Recognition”. 770–778. 10.1109/CVPR.2016.90.
- Ronneberger, Olaf & Fischer, Philipp & Brox, Thomas. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation”. LNCS. 9351. 234–241. 10.1007/978–3–319–24574–4_28.
- Redmon, Joseph & Divvala, Santosh & Girshick, Ross & Farhadi, Ali. (2016). “You Only Look Once: Unified, Real-Time Object Detection”. 779–788. 10.1109/CVPR.2016.91.
- Liu, Wei & Anguelov, Dragomir & Erhan, Dumitru & Szegedy, Christian & Reed, Scott & Fu, Cheng-Yang & Berg, Alexander. (2016). “SSD: Single Shot MultiBox Detector”. 9905. 21–37. 10.1007/978–3–319–46448–0_2.
- Girshick, Ross & Donahue, Jeff & Darrell, Trevor & Malik, Jitendra. (2013). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 10.1109/CVPR.2014.81.
- Abdel-Hamid, Ossama & Deng, li & Yu, Dong. (2013). “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition”.
- Yu, Fisher & Koltun, Vladlen. (2015). “Multi-Scale Context Aggregation by Dilated Convolutions”.