# Review: AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification)

In this story, **AlexNet **and **CaffeNet **are reviewed. AlexNet is the **winner of the ILSVRC (****ImageNet Large Scale Visual Recognition Competition****) 2012**, which is an image classification competition.

This is a **2012 NIPS** paper from Prof. Hinton’s Group with **about 28000 citations when I was writing this story.** It has an **essential breakthrough in deep learning which substantially reduce the error rate in ILSVRC 2012** as the figure shown below. Thus, this is a must read paper!! (SH Tsang @ Medium)

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 150,000 testing images.

A. For **AlexNet**, we will cover:

**Architecture****ReLU (Rectified Linear Unit)****Multiple GPUs****Local Response Normalization****Overlapping Pooling****Data Augmentation****Dropout****Other Details of Learning Parameters****Results**

B. For **CaffeNet**, it is just a **single-GPU version of AlexNet**. Since normally, people would only have one GPU, CaffeNet is a single-GPU network to simulate AlexNet. We will cover this as well at the end of this story.

By going through each component, we can know the importance of each component. Some of them are not so useful by now. But they do inspire for invention of other networks.

**A. AlexNet**

#### 1. Architecture

AlexNet contains **eight layers**:

Input: 224×224×3 input images

**1th: Convolutional Layer: 96 kernels of size 11×11×3 (stride: 4, pad: 0)**

55×55×96 feature maps

Then

**3×3 Overlapping Max Pooling (stride: 2)**

27×27×96 feature maps

Then

**Local Response Normalization**

27×27×96 feature maps

**2nd: Convolutional Layer: 256 kernels of size 5×5×48 (stride: 1, pad: 2)** 27×27×256 feature maps

**Then**

**3×3 Overlapping Max Pooling (stride: 2)**

13×13×256 feature maps

Then

**Local Response Normalization**

13×13×256 feature maps

**3rd: Convolutional Layer: 384 kernels of size 3×3×256 (stride: 1, pad: 1)**13×13×384 feature maps

**4th: Convolutional Layer: 384 kernels of size 3×3×192 (stride: 1, pad: 1)**13×13×384 feature maps

**5th: Convolutional Layer: 256 kernels of size 3×3×192 (stride: 1, pad: 1)**13×13×256 feature maps

Then

**3×3 Overlapping Max Pooling (stride: 2)**

6×6×256 feature maps

**6th: Fully Connected (Dense) Layer of **4096 neurons

**7th: Fully Connected (Dense) Layer of **4096 neurons

**8th: Fully Connected (Dense) Layer of**Output: 1000 neurons (since there are 1000 classes)

**Softmax**is used for calculating the loss.

In total, there are 60 million parameters need to be trained !!!

#### 2. ReLU

Before Alexnet, Tanh was used. **ReLU is introduced in AlexNet.**

And **ReLU is six times faster than Tanh** to reach 25% training error rate.

#### 3. **Multiple GPUs**

At that moment, NVIDIA GTX 580 GPU is used which only got 3GB of memory. Thus, we can see in the architecture that they split into two paths and use 2 GPUs for convolutions. Inter-communications are only occurred at one specific convolutional layer.

**Thus, using 2 GPUs, is due to memory problem, NOT for speeding up the training process.**

With the whole network **compared with a net with only half of kernels **(only one path), **Top-1 and top-5 error rates are reduced by 1.7% and 1.2% respectively.**

**4. Local Response Normalization**

**In AlexNet, local response normalization is used**. It is different from the batch normalization as we can see in the equations. Normalization helps to speed up the convergence. Nowadays, batch normalization is used instead of using local response normalization.

**With local response normalization, Top-1 and top-5 error rates are reduced by 1.4% and 1.2% respectively.**

**5. Overlapping Pooling**

Overlapping Pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.

**With overlapping pooling, Top-1 and top-5 error rates are reduced by 0.4% and 0.3% respectively.**

**6. Data Augmentation**

Two forms of data augmentation.

**First: Image translation and horizontal reflection (mirroring)**

A random 224×224 is extracted from one 256×256 image plus horizontal reflection. The size of training set is increased by a factor of 2048. This can be calculated as follows:

By image translation: (256–224)²=32²=1024

By horizontal reflection: 1024 × 2 = 2048

At the test time, four corner patches plus the centre patch as well as their corresponding horizontal reflections (10 patches in total), are used for prediction, and get the average of all results to obtain the final classification result.

**Second: Altering the intensity**

PCA is perform on the training set. For each training image, add the quantity:

where pi and λi are ith eigenvector and eigenvalue of the 3×3 covariance matrix of RGB pixel values, respectively, and αi is the random variable with mean 0 and standard variation 0.1.

**By increasing the size of training set with data augmentation, Top-1 error rate is reduced by over 1%.**

**7. Dropout**

With the layer that using dropout, during training, each neuron has a probability not to contribute to feed forward pass and participate in backpropagation. Thus, each neuron can have a larger chance to be trained, and not to depend so much for some very “strong” neuron.

During test time, there will be no dropout.

**In AlexNet, probability of 0.5 is used at the first two fully-connected layers. Dropout is a kind of regularization technique to reduce the overfitting.**

#### 8. **Other Details of Learning Parameters**

Batch size: 128

Momentum v: 0.9

Weight Decay: 0.0005

Learning rate ϵ: 0.01, reduced by 10 manually when validation error rate stopped improving, and reduced by 3 times.

Training set of 1.2 million images.

Network is trained for roughly 90 cycles.

Five to six days on two NVIDIA GTX 580 3GB GPUs.

#### 9. Results

**For ILSVRC 2010, AlexNet got the Top-1 and top-5 error rates of 37.5% and 17.0% respectively, which outperforms other approaches.**

**Without averaging 10 predictions** over ten patches by data augmentation, **AlexNet only got the Top-1 and top-5 error rates of 39.0% and 18.3% respectively.**

By **1 AlexNet (1 CNN)**, the validation error rate is** 18.2%**.

By **Averaging the prediction from 5 AlexNet (5 CNNs)**, the error rate is reduced to **16.4%**. This is **a kind of boosting technique** already used in LeNet for digit classification.

By **adding one more convolutional layer to AlexNet (1 CNN*)**, the validation error rate is reduced to **16.6%**.

By **Averaging the prediction from 2 modfiied AlexNet and 5 original AlexNet (7 CNNs*)**, the validation error rate is reduced to **15.4%**.

### B. CaffeNet

**CaffeNet is a 1-GPU version of AlexNet**. The architecture is:

We can see that the 2 paths in AlexNet are combined to become one path.

It is noted that for early version of CaffeNet, the order of pooling and normalization layers is reversed, this is by accident. But in the current version of CaffeNet provided by Caffe, it has already provided the Caffenet with the correct order of pooling and normalization layers.

By investigating each component one by one, we can know the effectiveness of each component. : )

If interested, there is also a tutorial about CaffeNet quick setup using Nvidia-Docker and Caffe [3].

### References

- [2012 NIPS] [AlexNet]

ImageNet Classification with Deep Convolutional Neural Networks - [2014 ACM MM] [CaffeNet]

Caffe: Convolutional Architecture for Fast Feature Embedding - VERY QUICK SETUP of CaffeNet (AlexNet) for Image Classification Using Nvidia-Docker 2.0 + CUDA + CuDNN + Jupyter Notebook + Caffe
- ILSVRC

ImageNet Large Scale Visual Recognition Competition