CNN Architectures — AlexNet

The first deep learning paper that showed state of the art performance in a real computer vision task.

Gary(Chang, Chih-Chun)
Deep Learning#g
5 min readJun 3, 2018

--

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

At the ILSVRC(ImageNet Large-Scale Visual Recognition Challenge) 2012 challenge, AlexNet outperformed other methods on image classification. After the success of AlexNet, CNNs started to spread exponentially in the computer vision community. It’s worthwhile mentioning that AlexNet used some of the now standard recipies in deep learning, such as ReLU units and dropout though they were first introduced by other papers.

Highlights of the Paper

  • First use of ReLU
  • Used Norm layers
  • Dropout 0.5
  • Batch size 128
  • SGD Momentum 0.9
  • Learning rate 0.01, reduced by 10 manually when val accuracy plateaus
  • L2 weight decay 5e-4
  • 7 CNN ensemble: 18.2% -> 15.4%

ImageNet Dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

The input size of AlexNet is fixed, so they downsampled the images to 256x256 by first rescaling images such that the shorter side was of length 256, and then cropping out the central 256×256 patch from the resulting images.

ReLU Non-Linearity

We can see from the below picture that if the backward input is larger, the gradient of tanh or sigmoid function will be very small(gradient vanishing), which leads to slow down the training speed.

http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=n01

The result showed that using ReLU function to training was 8 times faster than tanh to reach the same error rate.

Besides, if using tanh or sigmoid, we have to normalize inputs in -1~1 to avoid them entering the saturation region.

http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=n01

Local Response Normalization

Althought ReLUs have the desirable property that they do not require input normalization to prevent them from saturating, in AlexNet, they still applied normalization after applying the ReLU nonlinearity in certain layers.

They found that response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

Overlapping Pooling

In AlexNet, they used overlapping pooling to reduce dimensions. They found that with s = 2 and z = 3, the scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme with s = 2 and z = 2, which produces output of equivalent dimensions. They also found that during training that models with overlapping pooling were more difficult to overfit.

Overall Architecture

There are the details of layers in AlexNet.

According to the table, the network has 62.3 million parameters, and needs 1.1 billion computation units in a forward pass. It trained on GTX 580 GPU with only 3 GB of memory so the network spread across 2 GPUs, half the neurons (feature maps) on each GPU.

Training

During training, they use data augmentation and add dropout to the first two fully-connected layers to avoid overfitting.

1. Data Augmentation

First, they extracted random 227 × 227 patches and their horizontal reflections from the 256×256 images and training the network on these extracted patches. By doing this, they increased the size of training set by a factor of 2048.

Second, they altered the intensities of the RGB channels in training images by performing PCA on the set of RGB pixel values. PCA is a way of finding patterns in the data and once finding these patterns, it reduces the the number of dimensions without much loss of information. The object identity was invariant to changes in the intensity and color of the illumination, which reduced the top-1 error rate by over 1%.

PCA in python using skimage

2. Dropout

In the network, they added dropout in the first two fully-connected layers. Without dropout, the network exhibits substantial overfitting while dropout roughly doubles the number of iterations required to converge.

Results

1. Quantitative Results

The table above showed the result of AlexNet on ILSVRC-2010 test set. AlexNet obviously outperformed other best results achieved by others.

The table above gives information about the error rates on ILSVRC-2012 validation and test sets. In italics are best results achieved by others. Models with * were pretrained on ImageNet 2011.

2. Qualitative Results

Right: Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar.

Left: Five ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.

Some words extracted from the source paper, and from Stanford cs231n lecture slides.

If you like this article and consider it useful for you, please support it with 👏.

If you have any question, please feel free to let me know!

--

--