ImageNet Classification with Deep Convolutional Neural Networks

Philippe Beaudoin
Academic Origami
Published in
8 min readJun 13, 2016

--

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. NIPS 2012.

Link to full paper
Part of my
Academic Origami series. How do I read this?

This is the seminal paper that introduces AlexNet, the deep convolutional neural network (CNN) that really kicked ass on the ImageNet contest. That contest has 1.2M images each belonging to one of 1000 classes. This paper launched the current craze over CNN. On the query “Convolutional Neural Network” Google Scholar returns ~1400 papers between 2000 and 2011 and ~5500 papers since 2012.

Convolutional neural networks are not that complex to understand, but there is a fair amount of specialized nomenclature that I will not cover here. If you want a quick introduction you can read Chapter 6 of Michael Nielsen’s Neural Networks and Deep Learning (although you should read the whole book!)

When it was published, this paper managed to train one of the largest CNN to date, which they did through a number of key improvements including:

  • a very efficient GPU implementation and an architecture that allows training to happen on multiple GPUs,
  • the demonstration that the now ubiquitous ReLU non-linearity significantly decreases training time.

And then, to reduce the very real danger of overfitting, they used a number of techniques including the dropout algorithm which had recently been introduced.

Remember that each neuron in a neural network computes the weighted sum of its inputs, applies an offset, and then runs the result through a non-linear function.

Image from http://deepdish.io/public/images/activation-functions.svg

Historically, the most popular non-linear functions were the sigmoid or tanh. This paper makes a strong demonstration that the ReLU non-linearity significantly reduces training time on deep networks. The reason is likely that, for positive activations, ReLU’s derivative does not go towards 0. This is an important benefit since a derivative close to 0 means the gradient descent will converge very slowly.

Naturally, with negative activations the ReLU’s derivative is 0 which means it wont learn, but as long as some neurons have positive activations, then part of the network will be learning. It appears that it’s better to have only a subset of the neurons learning very efficiently than having most neurons learning very slowly.

This figure shows how much faster a network with a ReLU non-linearity (the solid line) learns compared to the same network with a tanh non-linearity (the dashed line). The magnitude of this effect naturally depends on the network and the task, but since this paper has been published, the advantage of ReLU has been confirmed in many other applications.

The proposed network is too large to fit on one GPU, so the authors propose to use the fact that modern architectures allow two GPUs to read and write from one another’s memory. Their final architecture is a column network, a term I’ve heard used a number of times since this paper. In a column network each GPU operates more-or-less independently except on some layers where they cross over.

Using two GPUs reduces their error rate by 1.2% - 1.7% compared to the single GPU network. Naturally, the single GPU network has roughly half as many convolution kernels as the two GPU implementation.

The goal of that slightly scary equation is to encourage different convolution kernels to learn different features. Remember that a kernel is a “matrix” applied on a square area of pixels. First, each entry in the matrix multiplies the pixel under it, then the neuron sums up everything, applies the offset, and passes it through the non-linearity. A problem occurs when two kernels have very similar “matrices”, which causes them to detect similar features, which in turn reduces the expressiveness of the network. This is what the equation here sets out to solve.

The equation first imposes an order to the kernels and looks at n adjacent kernels. The chosen order is totally arbitrary, if two kernels are adjacent it doesn’t mean they’re close together on the screen, it just means that the random order picked them to be neighbours. For each kernel, the equation inhibits an entry in the “matrix” if the adjacent kernels have a high value in that same entry.

In the above example, entry A of Kernel 3 will not be inhibited too much because the * entries of the adjacent kernels have a low value, whereas entry B will be more strongly inhibited because the † entries have a higher value.

Using this technique reduces the implementation’s error rate by 1.2% - 1.4%.

A quick note: I use quotes around “matrix” because a kernel is not really a matrix. For one, convolution is done through cell-wise multiplication which is not how standard matrix multiplication works. For second, the kernels are really tensors with 3 dimensions. For example, in this CNN, the kernels on the first layer have size 11×11×3 since the source image has three color channels.

Another improvement is the use of overlapping pooling. A pooling layer is typically applied after a convolutional layer. Its role is to downsize the image by gathering together a small window of pixels. In prior work, a given pixel was used only in one of the pooling neuron. In this paper they propose to overlap the pooling windows so that the same pixel can be used by two different pooling neurons. This reduces the error rate by 0.3% - 0.4%.

Fig.2 illustrates the full network architecture, with two clearly visible horizontal columns which cross over at layer 3. There are 5 convolutional layers and three fully connected layers. Here’s how to read that figure:

The input is a 224×224 image with 3 color channels. On the first layer, 96 kernels (48 per GPU) are applied to this image, each such kernel has size 11×11×3 and they are scanned over the image with a stride of 4 pixels on the X an Y directions. This results in 96 images of size 55×55 (48 such images per GPU) which are then run through a max pooling stage.

The figure is a bit misleading regarding the max pooling stage. It makes it look like this stage happens after the 5×5 convolution kernels. This is incorrect: max pooling happens before the 5×5 kernels are applied. Max pooling is applied in a sliding window of 3×3 pixels within the 55×55 images. This window is scanned over each of the 96 images with a stride of 2 pixels, leading to the overlapping pooling mentioned earlier. This downsizes the images by a factor of two, resulting in 96 images of 27×27.

Next, 256 convolution kernels (128 per GPU) of size 5×5×48 are applied on each of these 27×27 images, yielding 256 images (128 per GPU) of size 27×27. That stage clearly shows the impact of having two separate columns: each kernel is applied on only 48 of the 96 available images.

There are five stages of similar convolutions, typically happening independently on each GPU, except at stage 3 where a cross-over happens. After the convolution layers are three fully connected layers, the first two having 4096 neurons and the third one having 1000 neurons. A 1000-way softmax is applied just before the output.

Something that is not directly obvious from the figure is how many free parameters appear at each layer. Let’s look at a couple of examples. The first convolution layer has 11×11×3×96 ≈ 35k weights and 96 biases, the second one has 5×5×48×256 ≈ 31k weights. Between the last convolution layer and the first fully connected layer there are 6×6×256×4096 ≈ 37.7M independent weights (notice that the max pooling reduces the 13×13 images to 6×6, which is not shown in the figure). Finally, between the two fully connected we count 4096 × 4096 ≈ 16.8M weights.

A good way to make the most out of a dataset is often to augment it by distorting each sample in a way that ensures the label stays valid. The authors do this here by applying translations and horizontal reflections, increasing the number of training images by a factor of 2048 in the process. This data augmentation step is critical to limit overfitting. Another distortion they apply is to change the RGB color value of all the pixels every time the image is encountered in training. The proposed distortion captures the fact that natural images are invariant to changes in the intensity and color of the illumination. If you’re lost about how a PCA in RGB space works, just ask about it in a comment and I’ll help you figure it out.

The paragraph on the dropout algorithm is very clear, so I’ll let you read it. It’s interesting to know that, even with every mechanism described before, the network had a tendency to overfit which dropout fixes. Also interesting is the fact that dropout roughly doubles the required training time.

The paper has some more details on the parameters and mechanisms that were used during learning. There is nothing ground breaking here, but you should read the full paper if you want to implement the algorithm. One important detail is that the authors initialize the weights in each layer using the same Gaussian distribution. However, Nielsen convinced me that initialization was way too important to be done with uniform Gaussians, so I’m guessing the algorithm learning speed could be improved upon with more careful initialization. In case you’re curious, training on the full dataset took 6 days on two NVIDIA GTX 580 3GB GPUs.

--

--

Philippe Beaudoin
Academic Origami

SVP Research at Element AI, Ex Google engineer, now trying to beat the singularity to the finish line.