An analysis on computer vision problems

Shravan Murali
Sep 13, 2017 · 10 min read

At least for about a decade now, there have been drastic improvements in the techniques used for solving problems in the domain of computer vision, some of the notable problems being, Image classification, object detection, image segmentation, image generation, image captioning and so on. In this blog post, I’ll briefly explain some of these problems and also I’ll try to compare and contrast these techniques from how humans interpret images. I’ll also steer the article towards AGI (Artificial General Intelligence) and pitch in some of my thoughts on that.


Ok then, I hope you would’ve gotten enough motivation by now. There definitely are a ton of other technology that I’m missing. In fact there are a lot more similar technology that, one blog post isn’t enough for me to fit in all of that.

Let’s now check out some of those computer vision problems !

Computer vision

Image Classification

[The above picture was taken from Google Images]

And, there usually are lots of image classification contests happening around the world. Kaggle is a very nice place to find such hosted competitions. One of the most famous such contests is the ImageNet Challenge. ImageNet is basically a humungous repository of images (about 14 million of them at the time when this article was written) with over 20000 image tags. This is being maintained by the computer vision lab at Stanford University. The ImageNet challenge or the Large Scale Visual Recognition Challenge (LSVRC) is an annual contest that has various sub challenges such as object classification, object detection and object localization. The LSVRC, especially the object classification challenge, started gaining a lot of attention from the year 2012 when Alex Krizhevsky implemented the famous AlexNet, which stole the show by reducing the error rate on images to 15.7% (which at that time was never achieved). And, looking at the latest results, Microsoft’s ResNet has achieved an error rate of 3.57% and Google’s Inception-v3 has achieved 3.46% and Inception-v4 has gone even further.

[ That image originated in this paper written during 2017 by Alfredo Canziani, Adam Paszke and Eugenio Culurciello]

Object detection

[The above image was taken from Google Images]

This is slightly more complicated to solve as compared to classification. You’ll have to play around with the image co-ordinates a lot more here. The best known way to do detection right now is called Faster-RCNN. RCNN is Region Convolutional Neural Network. It uses a technique called Region Proposal Network, which is basically responsible for localizing on the regions in the image that need to be processed and classified. This RCNN was later tweaked and made more efficient and is now called Faster-RCNN. A Convolutional Neural Network is generally used as a part of the region proposal method to generate the regions. The most recent image-net challenge (LSVRC 2017) had an object detection challenge and was bagged by a teamed named “BDAT” which consisted of folks from Nanjing University of Information Science & Technology and Imperial College London.

Image Segmentation

Image segmentation is of 2 types, Semantic segmentation and Instance segmentation. In Semantic segmentation, you’ll have to label each pixel by a class object. Essentially, in such a case, every object that belongs to the same class (say every cat), will be colored the same. Whereas in Instance segmentation, every object is classified differently. This means that every cat in a picture would be colored differently.

Semantic segmentation in which cars are colored in dark blue
This is a classic example for Instance segmentation

[The above image was taken from Google Images]

It can also be seen that Semantic segmentation is a subset of Instance segmentation. So, we’ll see how to solve Instance segmentation.

The latest known technique to solve this is called Mask R-CNN, which is basically a couple of convolutional layers on top of the R-CNN technique we saw earlier. Microsoft, Facebook and Mighty Ai have jointly released this dataset named COCO. It is similar to ImageNet, but is mainly for segmentation and detection.

Image Captioning

[The above image was taken from Google Images]

Image captioning is basically Image Detection + Captioning. Image detection is done by the same Faster R-CNN method that we saw earlier. Captioning is done using a RNN (Recurrent Neural Net). To be very precise, LSTM (Long Short Term Memory), which is an advanced version of RNN is used. These RNNs are quite similar to our regular Deep Neural Networks except that, these RNNs depend on the previous state of the network. You can think of it more like a neural net with neurons building over time alongside space. Structurally, RNNs look like this :

Usually, these RNNs are used for problems in which your data is time dependent. For example, if you wanna predict the next word in a sentence, then the new word depends on all the words that showed up in the previous time steps.

Let’s now switch gears a little bit and look at human visual understanding.

Why are humans better at visual understanding ?

Although Deep Neural Nets seem wonderful and magical, they can unfortunately be fooled easily. Take a look at this :

[The above image was taken from Andrej Karpathy’s Blog]

As the image shows, every image is super imposed with a noise image, which visually doesn’t change the original image at all and yet got misclassified as ostrich !

Such attacks are called the Adversarial Attacks on Deep Neural Nets. They were initially brought up by Szegedy et al. in the year 2013 and was then further investigated by Goodfellow et al. in the year 2014. It was basically found that, we can find a minimal noise signal by optimizing the pixel intensities in the image to give priority to a different class in the deep neural network instead of the current one. This engendered the growth of Generative Models. There are 3 well known generative models as of today, namely, Pixel RNN / Pixel CNN, Variational Auto-encoders and Generative Adversarial Networks.

Human Visual Understanding

Besides, humans are continuously collecting data (for eg., collecting images through vision) at every point in their life unlike machines. Let’s take an example. Most of us see dogs almost everyday. And this also means that we would’ve seen dogs in different postures and from different angles. This means that, given an image which has dogs, there is a very very high probability that we’d be able to recognize a dog in the picture. This isn’t true for machines. Machines might’ve been trained only for a certain amount of dog images and hence can be fooled easily. If you feed in the image of the same dog with a slightly different posture, it might get misclassified.

Can A.I. really compete against “Human Brain” ?

In a talk by Jeff Dean, he mentioned the number of parameters that make up a Deep neural network for most of the published ones since 2011. And if you’d noticed, for humans, he mentioned “100 trillion”. Although he seemed to have considered that some what like a joke, that seems quite true given the amount of complex stuff the human brain can handle. Assuming that our brain is that complex, is it even practical to design a system with so many parameters ?

Well, it definitely is true that there were some major breakthroughs in the field of Artificial Intelligence like AlphaGo beating a world champion in the game of Go, OpenAI’s Dota2 bot beat experts in the game and many more. Yet, these sort of stuff seem very niche in the sense that, a Dota2 bot is very specific to Dota2 and not to anything else additionally. On the contrary, human brain is very generic. You’d use your brain for almost all of your day to day activities. From this what I’d like to infer is that, in order to compete with the mammalian brain, we’ll need a General Artificial Intelligence !

Some random thoughts

This is currently an active area of research and involves giants like DeepMind and OpenAI. In fact, DeepMind’s main motto is to “Solve General Artificial Intelligence” !

(If you’ve watched interstellar :P)

Shravan’s Blog

My tech blog mainly focussed on computer science domains like A.I. and Distributed Systems. Check if you wanna know more about me !

Shravan Murali

Written by

Shravan’s Blog

My tech blog mainly focussed on computer science domains like A.I. and Distributed Systems. Check if you wanna know more about me !