Breakthrough in Computer Vision

Atharva Barve
3 min readJan 25, 2020

--

ImageNet Challenge 2012 and AlexNet

Alex Krizhevsky, the man behind AlexNet

Starting with the ImageNet Challenge…

The concept of Machine Learning and Deep learning existed far before the implementation of those models, but the main reason behind this delay was the absence of computational speed and the amount of training data. Dr. Fei-Fei Li realized the need for Dataset creation which would act as training data for Computer Vision. After the successful creation of a huge dataset of images along with their labels, The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) came into existence as an annual competition from 2010 to 2017.

Godfather and the Breakthrough

The Original paper

Geoffrey Hilton is regarded as the Godfather of Deep Learning because of his noteworthy contributions in the field of Artificial Intelligence. He is a professor at the University of Toronto and also works at Google Brain. In his University, he had a student named Alex Krizhevsky who was working on his Ph.D. project, designed a model for Image Classification along with his colleague Ilya Sutskever. In 2012 three of them published a paper on the same. It was later named as AlexNet by Google in recognition of the contribution of Alex Krizhevsky.

Turning point…

The main reason AlexNet gained so much of popularity and is considered as the turning point in the field of Computer Vision is its performance in the ImageNet Challenge.

The top 5 percent error of this model was 15.3% while that of the runner-up was 26.1% which clearly shows the dominance of this architecture over others. AlexNet had 8 layers but further, the tech giants also participated in the competitions increasing the layers and the complexity and in 2015 they gained an accuracy more than of a human. Finally, in 2017 the ImageNet challenge ended.

Architecture and Intuition

AlexNet architecture

The architecture mainly consists of 8 layers as follows —

The 1st and 2nd layers are Convolutional layers followed by Max Pooling. Then we have 3 continuous Convolutional layers again. Finally, 2 Dense layers and an output vector of size 100 as the number of classes were 100.

We can observe that after passing the image to each Convolution layer, the width of the output increases and the size of the image shrinks, which signifies that we are extracted the features at each layer. The initially RGB color of the image was the 3 features of the first layer, as we move ahead of the number of features increase and finally, we have 2048 distinct features that are passed to the Dense layers.

I'll be making a detailed blog regarding the working of CNNs and AlexNet in detail soon.

--

--