An Oversimplified History of Machine Learning: Part 2
Link to Part 1
The VGG architecture is a deeper convolutional neural network (CNN) than AlexNet.
In the diagram above, the dimensions are provided in the format of [H x W x C] with H and W being the height and width of the image in pixels and C being the number of channels or the number of filters contained in the layer. The input layer, the original picture, starts with 224 pixels by 224 pixels and three channels that give it color: R, G, and B. The hundreds of filters applied to each layer create hundreds of channels in the intermediate representation. The max-pooling layer halves the dimensions of the previous layer. Once the network has reached the fully connected dense network layer, the last layer produces an array of 1000 numbers, each between zero and one. The total of these numbers is 1.00. You can think of each number as the probability that the image belongs to the corresponding class; thus, the model produces 1,000 probabilities corresponding to the 1,000 classes.
VGG’s many layers surpassed all architectures in the 2014 Imagenet competition, ILSVRC, achieving accuracy greater than 90% in the category of machine learning known as Object Identification.
With the creation of VGG, data scientists realized that the deeper the network, the greater the accuracy, leading them to create the VGG-19 network. With 19 layers in addition to max-pooling layers, the authors created the deepest functioning network at the time-much deeper than AlexNet’s 8 layers.
The greatest innovation discovered by VGG is equivalent receptive fields. Large receptive fields (i.e. AlexNet’s 11x11 convolutions.) took many parameters, while a series of small 3x3 convolutions of the equivalent receptive field could process the same amount of information with much fewer parameters. It is better to do more with less. More parameters mean more opportunities to overfit.
The next neural network in the timeline to break state of the art records was ResNet. This new network solved the major problem data scientists faced while implementing VGG structures. The model improves upon the VGG model by adding skip connections. Simply put, skip connections help smooth out the loss landscape which makes it easier for the network to converge to a good solution.
In addition to skip connections, the authors found that bottlenecking helped further improve the model. 1x1 convolution followed by a 3x3 convolution followed by another 1x1 convolution replaced each layer to create deeper networks. The 1x1 convolutions work to reduce the dimensionality of the image, which in turn reduces the amount of computational work that needs to be done by the 3x3 convolution.
If we do the math on the parameters in two 3x3 convolutions vs. two 3x3 convolutions with a bottleneck, you can see how the bottleneck can function to reduce the total number of parameters.
Bottlenecking significantly reduced the number of parameters compared to the original amount. Remember, the fewer parameters, the lower the chance the model will overfit.
ResNet ends with a global average pooling instead of a flattening layer to prevent overfitting from location-based activations. For example, if a flattening model trained on images of dogs that only sat on the left side of the screen, any dog on the right side of the screen would not be recognized.
DenseNet utilizes the skip connections introduced by ResNet more aggressively. Each layer connects to every previous layer in each block. These skip connections differ with weight channels that are combined through concatenation rather than addition.
DenseNet also uses global average pooling layer instead of a flattening layer to prevent overfitting from location-based activations. The network adopts the bottleneck blocks from ResNet to further improve its structure. The numerous skip connections allow the model to run at the same accuracy as ResNet with 3x fewer parameters, easing computational work during training.
Inception was developed to reduce computational costs while retaining accuracy. Using width (increasing the number of units per layer) Inception was able to reduce the number of parameters the network by going wider instead of deeper.
Width is measured by the number of paths the network takes before concatenating in the next layer. 1x1 convolutions before the larger convolutions have the same effect as they do when implemented in bottleneck layers. By following 1x1 with 3x3 or 5x5, the number of feature maps that need to be processed by those large kernels can be reduced, which is good for reducing the number of operations.
It should be noted that when implementing wideness, because it increases the number of parameters per layer, training the model can take longer than training a traditional “narrow and deep” model. For datasets that are smaller than the intended ImageNet dataset, reducing the number of Inception blocks and filters can help your model run at a reasonable pace. But more importantly, more parameters makes this model prone to overfitting, which Inception prevents with a dropout layer near the end.
You will note that when looking at the flowchart diagram of the model, the model contains multiple softmax layers which are usually reserved as the last layer in a model. The reason is that deep architecture inevitably faces the ‘vanishing gradients’ problem, the same problem VGG faced when it reached above 100 layers. This occurs because gradients lose more information the further they go down. GoogLeNet realized that intermediate layers hold important information in addition to the final layers, so they are combined at the end of training to improve accuracy. The final loss is a combination of an intermediate loss and the final loss, thus multiple softmax layers. However, the result of the Inception model still lagged behind ResNet and need to be improved in the Inception-v3 model for accuracy.
In GoogLeNet’s Inception-v3 model, they take a page from VGG and replace their larger convolutions (5x5) with a series of 3x3 of the equivalent receptive field, a concept that reduces parameters while still retaining information. They then take the concept one step further and divide the 3x3 convolution into 1x3 and 3x1 convolutions (they must be together), creating a receptive field equal to 3x3 while reducing operations.
The paper also mentions ‘model regularization via label smoothing’ which I won’t go into detail because it involves a lot of Greek symbols, but the idea is that overconfidence in the model loss can cause overfitting and hinder training. They solve this by defining loss in a way that makes the model less confident in its answers, forcing it to be more adaptable.
Inception-v4 combines inception concatenations and residual connections from ResNet to create a rather complex solution to image classification.
The left diagram shows an Inception-ResNet module. The right diagram shows a downsampling module. These modules have different hyperparameters, paths, and convolutions. These paths hold many parameters similar to previous Inception models, so dropout layers are necessary to prevent overfitting.
The model, in addition to its residual connections, is considerably deeper than the Inception-v3 model, placing its accuracy above ResNet.
Although the model proves itself to be more powerful than Inception-v3 and ResNet, other than the combination of addition and concatenation, the model must be very specific with its output and input sizes. Feature maps and other hyperparameters cannot be changed carelessly. It can be difficult to adjust and work with this model from scratch.
The ResNeXt model introduced the concept of ‘cardinality’, similar to the wideness of a model, except applied to residual additions, rather than concatenations. The ResNeXt blocks take in an image and divide it into many bottleneck layer transforms.
The many paths the input takes allow the number of ResNeXt operations and parameters in a 50-layer model to be nearly equal to a 50-layer ResNet model. This means more information is shared across the network without increasing the number of parameters that can cause overfitting. Through their experiments in cardinality, ResNeXt found that increasing cardinality reveals significantly better results than increasing depth or width. In addition to cardinality, ResNeXt retains its residual connections, allowing information to skip all the convolution paths, allowing more information to be shared across the network while still filtering the image.
Even better, the model contains the same module with increasing feature maps, so recreating the model is clean and easy. With the repeating building blocks of the same dimensions, few hyperparameters need to be set.
10. The Future of Neural Network Architecture/NASNet -Jul 2018- Original Paper
Currently, machine learning engineering has so many tools at its disposal that are well understood, including a series of 3x3 convolutions, bottlenecking layers, concatenations, and dropout to name a few. But with so many tools, it can be difficult to design a usable architecture that implements each one meaningfully and efficiently. Thus, Google decided to create a model that creates other neural network models: NASNet.
The model starts with a predetermined general structure for different datasets, but what’s inside each cell is determined by the model. The only requirements that must be fulfilled for the model are that the normal cell must return the same size that it took in and the reduction cell must downsample, or reduce the dimensions, of the input image and then end with a softmax.
The model constructs each block by taking in two layers, applying an operation from a list of operations on each of them and then either add or concatenate them.
Since this initializes randomly, the model needs to learn based on some metric like accuracy. The created model needs to be tested on the entire dataset for accuracy, meaning training can take extremely long, even on multiple GPUs. Once it learns what works and what doesn’t, it outputs a state of the art architecture.
This was the model created after using 500 Nvidia P100 GPUs that ran for four days. And the best part, the model not only works, but it also beats state of the art architecture, reaching accuracy as high as 96% in ImageNet competitions. This surpasses all previous models.
- Dekhtiar, Jonathan. “Why Convolutions Always Use Odd-Numbers as filter_size.” Data Science Stack Exchange, 1 Mar. 1968, datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size.
- Dertat, Arden. “Applied Deep Learning — Part 4: Convolutional Neural Networks.” Medium, Towards Data Science, 13 Nov. 2017, towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2#9a7a.
- Despois, Julien. “Memorizing Is Not Learning! — 6 Tricks to Prevent Overfitting in Machine Learning.” Hackernoon, 20 Mar. 2018, hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42.
- He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” arXiv:1512.03385, Cornell University. 10 Dec. 2015, https://arxiv.org/abs/1512.03385
- Huang, Gao, et al. “Densely Connected Convolutional Networks.” arXiv:1608.06993, Cornell University. 25 Aug. 2016, https://arxiv.org/abs/1608.06993?source=post_page
- Krizhevsky, Alex, et al. “ImageNet Classification with Deep Convolutional Neural Networks.” 2012 NIPS Proceedings Beta, https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
- Simonyan, Karen, et al. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv:1401.1556, Cornell University. 4 Sep. 2014, https://arxiv.org/abs/1409.1556
- Szegedy, Christian, et al. “Going Deeper with Convolutions.” arXiv:1409.4842, Cornell University. 17 Sep. 2014, https://arxiv.org/abs/1409.4842
- Szegedy, Christian, et al. “Rethinking the Inception Architecture for Computer Vision.” arXiv:1512.00567, Cornell University. 2 Dec. 2015, https://arxiv.org/abs/1512.00567
- Szegedy, Christian, et al. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.” arXiv:1602.07261, Cornell University. https://arxiv.org/abs/1602.07261
- Xie, Saining, et al. “Aggregated Residual Transformations for Deep Neural Networks.” arXiv:1611.05431, Cornell University. https://arxiv.org/abs/1611.05431
- Zoph, Barret, et al. “Learning Transferable Architectures for Scalable Image Recognition.” arXiv:1707.07012, Cornell University. https://arxiv.org/abs/1707.07012