Computer Vision: A Study On Different CNN Architectures and their Applications

Yash Upadhyay
Jan 4, 2019 · 11 min read
Computer Vision

Humans are heavily dependent on five senses to interpret the ongoing activities in the world around us. Though each of our senses is important, vision (see) is by default the most used for the daily tasks we do. Eyes, through which we see and perceive a lot of things, help us see the path we walk on, the road we drive on, and keep checks for any possible collision. Vision is so important that it came naturally to researcher scientists and engineers to recreate it in the machines. The aim is to train machines to visualize and act accordingly while minimizing human error and intervention.

What is ‘Computer Vision’?

Computer Vision is an interdisciplinary field of science that aims to make computers process, analyze images and videos and extract details in the same way a human mind does.

Earlier Computer Vision was meant only to mimic human visual systems until we realized how AI can augment its applications and vice versa. We may also not realize this every day but we are being assisted by the applications of Computer Vision in automotive, retail, banking and financial services, healthcare, etc.

Deep Neural Networks (DNN) have greater capabilities for image pattern recognition and are widely used in Computer Vision algorithms. And, Convolutional Neural Network (CNN, or ConvNet) is a class of DNN which is most commonly applied to analyzing visual imagery. It is used not only in Computer Vision but also for text classification in Natural Language Processing (NLP).

How CNNs Work?

Most of the Computer Vision tasks are surrounded around CNN architectures, as the basis of most of the problems is to classify an image into known labels. Algorithms for object detection like SSD(single shot multi-box detection) and YOLO(You Only Look Once) are also built around CNN.

Image result for convolutional neural network
CNN Architecture

Artificial neural networks were great for the task which wasn’t possible for Conventional Machine learning algorithms, but in case of processing images with fully connected hidden layers, ANN takes a very long time to be trained. Due to this, CNN was used to first reduces the size of images using convolutional layers and pooling layers and then feed the reduced data to fully connected layers.

Now, let’s talk about layers of CNN. Below we’ve described the architecture of CNNs in detail:

Convolution layer

Convolution Operation

To perform convolution operation, a filter (A smaller matrix)is used whose size can be specified. This filter moves all over the image matrix and its task is to multiply its values by the original pixel values. All these multiplications are summed up to one number at the end. The filter moves further and further to its right by n units(can vary) performing a similar operation. After passing across all the positions, a matrix is obtained which is much smaller in size than the input matrix.

Nonlinear layer

This layer is added after each of the convolution layers. It uses an activation function to bring non-linearity to data. Non-linearity means that the change of the output is not proportional to the change of the input. We require this nonlinearity because if the network was linear, there would be no point in adding multiple layers (multiple linear layers are equivalent to a single layer). By increasing the nonlinearity, a complex network is created to find new patterns in the images. The activation function here can be Rectified Linear Unit (ReLU), Tanh or any other nonlinear activation function. Read more about activation functions here.

Pooling Layer

Max Pooling

Pooling layer is used to further downsize the matrix. The most common form of a pooling layer is with filters of size 2×2; applied with a stride of two downsamples at every depth slice in the input along both width and height, discarding 75 per cent of the activations. Pooling layer is generally used to select the most important pixels by using Max pooling function which only selects the highest value pixel present in the filter. This reduces the amount of computation required for training, hence reducing the time taken for training the neural network significantly.

Pooling can be done in various ways:

  1. Max pooling: The largest element in the matrix is selected.
  2. Min pooling: The smallest element in the matrix is selected.
  3. Mean pooling: The mean of the elements of the matrix is selected.
  4. Average pooling: The Average of the elements of the matrix is selected.

Fully Connected (FC)Layers

Multilayer perceptron neural network

In Fully Connected layers, every neuron from one layer is connected to every neuron in another layer. Principally, FC acts similar to as the traditional neural network, Multi-Layer Perceptron (MLP). However, the only difference is that the inputs would be in the shape and form created by the previous stages of a CNN.

Deep Learning for Computer Vision

CNN Based Architectures

Many CNN based architectures have been used to maximize performance in image classification. These architectures are of the famous architecture are discussed below :

AlexNet (2012)

AlexNet, designed by the SuperVision group, including Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever from the University of Toronto, was the winner of the 2012 ImageNet LSVRC-2012 competition. ImageNet is a yearly competition focused on image classification, with an error rate of 15.3 per cent. AlexNet uses ReLu activation function instead of tanh to add non-linearity, which accelerated the speed of training (by 6 times) and increased the accuracy. It also uses dropout regularisation (a technique prevents complex co-adaptations on training data to reduce overfitting). Another feature of AlexNet is that it overlaps pooling to reduce the size of the network. It reduces the top-1 and top-5 error rates by 0.4 per cent and 0.3 per cent, respectively.

AlexNet Architecture

AlexNet has five convolution layers and three fully connected layers. ReLu function is applied after every Convolutional layer and fully connected layer. Dropout is applied only before the first and the second fully connected layer.

AlexNet architecture has 62.3 million parameters and needs 1.1 billion computation units in a forward pass. In the paper for AlexNet, it is specified that the network takes 90 epochs in five or six days to train on two GTX 580 GPUs. Also, Stochastic Gradient Descent (SGD)with learning rate 0.01, momentum 0.9 and weight decay 0.0005 is used. Learning rate is divided by 10 once the accuracy plateaus. The learning rate decreases to three times during the training process.


Architecture for GoogLeNet

GoogLeNet is the winner of the ILSVRC 2014; it achieved a top-5 error rate of 6.67 per cent. The network uses a CNN inspired by LeNet. Its architecture contains 1×1 Convolution at the middle of the network and global average pooling is used at the end of the network instead of using fully connected layers. It also makes use of the Inception module, to have different sizes/types of convolutions for the same input and stacking all the outputs. It also uses batch normalization, image distortions, and RMSprop.

In GoogLenet, 1×1 convolution is uses as a dimension reduction module to reduce computation. By reducing the computation bottleneck, depth and width can be increased. GoogLenet’s architecture consists of a 22 layer deep CNN but reduces the number of parameters from 60 million (AlexNet) to 4 million.

VGGNet (2014)

Architecture for VGG Net

VGGNet was invented by VGG (Visual Geometry Group) from the University of Oxford. Though VGGNet is the 1st runner-up, not the winner of the ILSVRC 2014 in the classification task, it still showed a significant improvement to the previous Networks. VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform architecture. Similar to AlexNet (3x3 convolutions) but lots of filters. It is mostly used for extracting features from images. VGG-16 is used as a base for object detection algorithm SSD, without fully connected layers.


Architecture for Resnet

Residual Neural Network (ResNet) by Kaiming He et al, won the ILSVRC 2015. It achieved a top-5 error rate of 3.57 per cent that beats human-level performance on this dataset. It introduces an architecture which consists of 152 layers with skip connections(gated units or gated recurrent units) and features heavy batch normalization. The whole idea of ResNet is to counter the problem of vanishing gradients. By preserving the gradients, Vanishing gradients is the problem that occurs in networks with high number of layers as the weights of the first layers cannot be updated correctly through the backpropagation of the error gradient (the chain rule multiplies error gradient values lower than one and then, when the gradient error comes to the first layers, its value goes to zero).

Applications of Computer Vision

In modern days, Computer Vision has found many areas where it can be utilized. It automates processes in a way that not only reduces human effort but also provides us with solutions to the task that could never have been solved by the limitations of the human vision.


Photo by rawpixel on Unsplash

Computer Vision is widely used in the diagnosis of diseases by processing the X-rays, MRIs and other medical images. It has been proved to be as effective as regular human doctors when it comes to the matter of precision. Health problems like pneumonia, brain tumour, diabetes, Parkinson’s diseases, breast cancer, and many others are being diagnosed successfully day-to-day with the help of Computer Vision.

With the help of the state-of-the-art image processing techniques and Computer Vision, early-diagnosis of any plausible diseases will be possible. Thus allowing treatment in a premature stage of the disease or even squashing the chance of them ever occurring.

Computer Vision not only helps in diagnosis but also plays a key role in surgery by analyzing damages to the tissues and monitoring the blood loss of the patient. Computer vision has also helped researchers monitor a patient’s adherence to their prescribed treatments, reducing attrition in Clinical Trials.


Photo by Saketh Garuda on Unsplash

With the increased hype of the self-driving cars, automobile industries are heavily dependent on Computer Vision since it is meant for understanding the driving environment, including detecting obstacles, pedestrians, lanes, and possible collision paths.

Computer Vision is now also used as driver assistants which helps the driver notifying it of certain situations. It also monitors the driver for negligence driving by analysing its correct behaviour and driving pattern, hence reducing the chances of any misfortune. It checks if he is driving rashly, or under influence of alcohol or drugs, and if he is drowsy.

Computer Vision is also embedded in the (process of) automated productions of cars where it rejects the defective components on the assembly line.

Security and Surveillance

Photo by Veit Hammer on Unsplash

These days houses, metro stations, roads, schools, hospitals or in fact, every building demands constant surveillance for theft, damage and security. So, we equip them with a network of closed-circuit cameras. However, we can only use these cameras are used as evidence against a certain crime rather than being a tool in averting that crime. This is where Computer Vision pitches in.

Security system with Computer Vision capabilities not only detect crime like violence, theft, trespassing but also use its face recognition ability to find or locate criminals in crowded areas like airports and train stations.


Photo by Rahul Bhosale on Unsplash

All our knowledge about the universe is derived from the measurements of photons which are mostly images. This opens the possibility of application of Computer Vision in astronomy as our universe is so vast and its only natural that the data collected will also be large.

Studying this data manually won’t be possible for the astronomer, or any human. Using Computer Vision, we can analyse all the data at a much faster rate. Presently, Computer vision is already being used for discovering new planets and heavenly bodies, this includes application like exoplanet imaging, star and galaxy classification, etc.


Machine Learning and Computer Vision in farming

In agriculture, Computer Vision is used to determine the health of seeds to be sown. Using hyperspectral or multispectral sensors, the health of the crops can also be determined. The technology can also help in identifying the areas with fertile soil, presences of water bodies, hence identifying areas suitability for agriculture.

Computer Vision is also enabling robots to carry out processes such as harvesting, planting, weeding. The autonomous tractors use machine vision to do all the heavy and time-consuming tasks on a field, which reduces the stress on the farmers. With this tech, one can track its livestock and even monitor their growth over the course of a lifetime to obtain important information on them for usability.


Computer Vision aiding in manufacturing processes.

In Industries, Computer Vision is used on the assembly lines for counting batch, detecting damaged components, for the inspection of the finished goods. Here, Machine Vision tools aid in finding microscopic level defects in products that simply cannot be identified through human vision. In manufacturing tasks, reading barcodes or QR code are essential as they provide a unique identification to a product. Reading thousands of barcodes in a day is not an easy task for humans, but, it can be done easily in minutes through Computer Vision.

Satellite Imagery

Photo by SpaceX on Unsplash

Computer Vision is applied to satellite images to detect natural hazards like floods, tsunamis, hurricanes, and landslides. Satellite images are also used to analyze pollution and air quality index of areas of focus.

Manual mining just for checking the presence of ore can be costly and it may lead to a huge waste of money. Recently, the use of Computer Vision in mining industries has started to detect areas with the high possibility of having crude oil or minerals.

Challenges for Computer Vision

Computer Vision is heavily dependent on the quality of images, the factors like which camera was used, what time of the day was the image/video taken, and if the camera was stable.

Applications like facial recognition and video analysis usually face huge problems because of the low-quality CCTV used to distinguish people. In the case of object detection, the size of the objects and the model’s accuracy plays an important role. Small objects aren’t easily detected. Even if they are detected, the detection is unstable. It is also affected by deformation of the objects, background of the image and the extent of occlusion.

Another factor that causes hindrance to Computer Vision is the Knowledge of the model. If an object or image which wasn’t present in the training set, the model will only show incorrect results. This can be a problem, for example, a weapons detection system is deployed at a railway station which is only trained for guns and knives, and the terrorists bring in bombs which can go undetected through the system, hence putting lives in danger.

Despite challenges, which we are already overcoming with, Computer Vision offers wonderful research and innovation opportunity to every tech enthusiast. The is a boon to both AI/ML developers (who identify patterns for them) and users (who are the receiving end of the tailored user-friendly product).

AlumnAI Academy

Providing ingenious learning experiences on Emerging…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store