Published in

Various Computer Vision Architectures — What’s the difference

Computer Vision started as a topic in the late 1960s. The aim of computer vision is to mimic the human vision and understanding. The base of Computer Vision is image processing. In the modern era, Image processing is combined with AI Algorithms to mimic the functioning of human brain on images. Unlike numerical data, images need to be processed and interpreted in a different manner for the compute to make any meaningful output. Since a long time, CNNs have been trending in the field of Machine Learning. But traditional CNNs have many disadvantages when thinking in large scale. To overcome this, there are many architectures that have been designed. In this article, we will have a look at various popular computer vision architectures, their design and specialty.


AlexNet was designed in 2012 for large scale image classification, 1000 classes to be exact. The design starts with 11x11 convolution layers, moves to 5x5 and narrows down to 3x3. AlexNet contains total of 8 layers — 5 CNN and 3 Pooling.

The AlexNet achieves good accuracy because of training large dataset (ImageNet) on GPUs.


Very Deep Networks might provide good results, but at some point, of time can result in Vanishing Gradient Problem. This arises because of the large number of weights to update, the gradient keeps getting smaller as the algorithm reaches towards the earlier layers of the network (Backpropagation starts from back). ResNet is a large-scale Deep Neural Network that uses Skip Connections to prevent Vanishing Gradient and increase the overall performance of the network. ResNet skips over few layers, generally 2 or 3. This ensures that gradient does not drop to 0 on earlier layers when the network is trained. To Understand better, take a look at the design:


Similar to AlexNet, VGG is a large-scale image classification based on the traditional CNN Architecture. VGG has other variants too — VGG-16 and VGG-19 are the popularly used ones with 16 and 19 layers respectively. The difference between VGG and AlexNet is that VGG focuses on smaller kernel and strides compared to AlexNet and has a deeper architecture too. Also, VGG has more number of ReLU units in comparison with AlexNet because of the deeper architecture and hence the mapping function is more discriminative and performs better.


GoogeNet is a CNN architecture developed by Google Researchers using a modified version of Inception Module. The inception module uses multiple filters side by side and combines their outputs into a single layer.

The advantage of Inception is that the location of information within an image does not affect the network since multiple filters are combined and that information is combined all into one. This avoids the need of any deep networks and vanishing gradient descent at the same time preserving the accuracy of any network. Also, wider networks is easier to train in comparison with deeper networks.

On observing the architecture closely, you can observe two softmax layers in the middle part of the network. The loss from these two are taken into consideration during the training phase to prevent any vanishing gradient in case. This is proposed to come into play especially during the end of training, when the loss and accuracy saturates.


The networks we saw till date send the output of one layer to another in a sequential manner. DenseNet works quite opposite to that. DenseNet is a densely connected Neural Network where the output of each layer is sent to every other layer to implement feature reuse. This way, the networks can be shallow but at the same time learn the features effectively because of its dense nature.

The DenseNet is mainly made up of two units — DenseBlock and Transition Layer. DenseBlock has Convolution, Batch Normalization and ReLU with Dense connections while Transition Layer reduces the complexity of the model using 1x1 convolutions and reducing heigh and width of the output using 2-stride maxpooling.*qg5cCnke3684W1w5z32ddg.png


Efficient Net takes into account scaling of CNNs. According to experimental studies, it is proven that as the image resolution increases, increasing depth and height is necessary to bring about a good accuracy. But scaling any one will have only limited benefits — Hence the concept of compound scaling. Compound scaling devises a principle where depth, width and resolution has to be scaled in a balanced manner. On performing a grid search for the parameters alpha, beta and gamma that define how much scaling should happen for depth, width and resolution, it is concluded that to scale a CNN, depth should increase 20%, width should increase 10% and resolution should increase 15%. These values maintain a balance among the scaling and provide efficient results.

Thank you for reading!




Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

How to Create Python Packages

Implementing your own ResNet and DenseNet models for Covid-19 detection from chest X-Ray

Pillar Based 3-D Point Cloud Object Detection Implementation on Waymo Open Dataset

Why we need to deal with imbalanced classes

Neural Networks From Scratch: A Simple Fully Connected Feed Forward Network in C++

“Graph Data” Science-Research, October 2021 — summary from Arxiv and Europe PMC

How-to Build a Transformer Tokenizer

A brief guide to CNN: Convolutional Neural Networks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vishnu U

Vishnu U

I am a Computer Science Student undergraduate from Dayananda Sagar University, Bangalore.

More from Medium

My Experience as a Computer Vision Intern

Basics of Computer Vision: 1. Interpolation and Resizing

Few-Shot Learning

How Do I Get A Job In The Computer Vision Industry?