COMPUTER VISION

AlexNet Architecture Explained

The convolutional neural network (CNN) architecture known as AlexNet was created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s PhD advisor.

6 min readJun 24, 2022

Billionaire investor and entrepreneur Peter Thiel’s favourite contrarian question is

What important truth do very few people agree with you on?

If you would have asked this question to Prof. Geoffrey Hinton in 2010. He would have said, “Convolutional Neural Networks (CNN) have the potential to generate a seismic shift in tackling the problem of picture categorization,”. Researchers in the field at the time would not have given such a remark a second thought. Deep Learning really wasn’t cool!

That was the year ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched.

He and a few other researchers were proven correct in two years with the publication of the paper “Image Net Classification with Deep Neural Networks” by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. The Richter scale was broken by the earthquake! By obliterating outdated concepts in a single, brilliant motion, the article created a new environment for computer vision.

The study employed CNN to obtain a Top-5 error rate of 15.3 per cent(percentage of not correctly identifying an image’s genuine label among its top 5 guesses). The second-best outcome lagged far behind (26.2 per cent). Deep Learning became popular once more after the dust settled.

Several teams would develop CNN architectures over the following few years that would surpass human-level accuracy. After the original author, Alex Krizhevsky, the architecture utilised in the 2012 study is known as AlexNet.

~ introduction from the blog “Understanding AlexNet” by Sunita Nayak.

AlexNet Architecture

This was the first architecture that used GPU to boost the training performance. AlexNet consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected layers and 1 SoftMax layer. Each convolution layer consists of a convolution filter and a non-linear activation function called “ReLU”. The pooling layers are used to perform the max-pooling function and the input size is fixed due to the presence of fully connected layers. The input size is mentioned at most of the places as 224x224x3 but due to some padding which happens it works out to be 227x227x3. Above all this AlexNet has over 60 million parameters.

Key Features:

‘ReLU’ is used as an activation function rather than ‘tanh’
Batch size of 128
SGD Momentum is used as a learning algorithm
Data Augmentation is been carried out like flipping, jittering, cropping, colour normalization, etc.

AlexNet was trained on a GTX 580 GPU with only 3 GB of memory which couldn’t fit the entire network. So the network was split across 2 GPUs, with half of the neurons(feature maps) on each GPU.

Max Pooling

Max Pooling is a feature commonly imbibed into Convolutional Neural Network (CNN) architectures. The main idea behind a pooling layer is to “accumulate” features from maps generated by convolving a filter over an image. Formally, its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computations in the network. The most common form of pooling is max pooling.

Max pooling is done in part to help over-fitting by providing an abstracted form of the representation. As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation. Max pooling is done by applying a max filter to (usually) non-overlapping sub-regions of the initial representation.

The authors of AlexNet used pooling windows, sized 3×3 with a stride of 2 between the adjacent windows. Due to this overlapping nature of Max Pool, the top-1 error rate was reduced by 0.4% and the top-5 error rate was reduced by 0.3% respectively. If you compare this to using non-overlapping pooling windows of size 2×2 with a stride of 2, that would give the same output dimensions.

ReLU Non-Linearity

AlexNet demonstrates that saturating activation functions like Tanh or Sigmoid can be used to train deep CNNs much more quickly. The image below demonstrates that AlexNet can achieve a training error rate of 25% with the aid of ReLUs (solid curve). Compared to a network using tanh, this is six times faster (dotted curve). On the CIFAR-10 dataset, this was evaluated.

Data Augmentation

Overfitting can be avoided by showing Neural Net various iterations of the same image. Additionally, it assists in producing more data and compels the Neural Net to memorise the main qualities.

Augmentation by Mirroring

Consider that our training set contains a picture of a cat. A cat can also be seen as its mirror image. This indicates that by just flipping the image above the vertical axis, we may double the size of the training datasets.

Augmentation by Random Cropping of Images

Randomly cropping the original image will also produce additional data that is simply the original data shifted.

For the network’s inputs, the creators of AlexNet selected random crops with dimensions of 227 by 227 from within the 256 by 256 image boundary. They multiplied the size of the data by 2048 using this technique.

Dropout

A neuron is removed from the neural network during dropout with a probability of 0.5. A neuron that is dropped does not make any contribution to either forward or backward propagation. As seen in the graphic below, each input is processed by a separate Neural Network design. The acquired weight parameters are therefore more reliable and less prone to overfitting.

AlexNet Summary

Architecture Implementation

Import Libraries and Load the Dataset

For the implementation process, we will be taking a part of the ImageNet dataset by scraping images over the internet using a python library named ‘Beautiful Soup’ and will be passing this dataset on our model to check how is the performance of the AlexNet architecture.

Pre-processing

Once we have scraped the images we will be storing the images according to the data labels and we will pre-process the data.

Define the Model.

We will be creating the AlexNet Architecture from scratch, although there is a pre-defined function in Keras that will help you to run the AlexNet Architecture.

Initialize the training parameters

Train the model

Prediction

~ ‘AlexNet Architecture: A Complete Guide’ by Paras Varshney

Hence, from various blogs, articles and tutorial videos I have tried to present you with a piece of collective information about the AlexNet architecture. I would like to thank all the authors and creators who have done an amazing research on this architecture from their side.

And with that, we have completed the AlexNet Architecture, if anyone would like to understand this architecture deeply can check out the paper which was published (The link has been added in the introduction part of this blog). Keep Learning.