At least for about a decade now, there have been drastic improvements in the techniques used for solving problems in the domain of computer vision, some of the notable problems being, Image classification, object detection, image segmentation, image generation, image captioning and so on. In this blog post, I’ll briefly explain some of these problems and also I’ll try to compare and contrast these techniques from how humans interpret images. I’ll also steer the article towards AGI (Artificial General Intelligence) and pitch in some of my thoughts on that.
Before we dive deeper, let’s get some motivation from how some companies have creatively used computer vision. One of the coolest startups according to me is clarifai.com . Clarifai was started up by Matthew Zeiler who, with his team went on to win the imageNet challenge during the year 2013. His model reduced the error rate in image classfication by almost 4% from the best accuracy of the previous year. Clarifai is basically an A.I. company that provides APIs for visual recognition tasks like image and video labelling. Clarifai has a demo here. This company is very promising and it’s image and video recognition technology are insanely accurate. Let’s now move on to Facebook’s automatic image tagging. The next time you login to your Facebook account, right click on any image and click on “inspect element” (this is for chrome; there are equivalent stuff on other browsers). Check out the alt attribute in the img tag ( should look something like this
<img src = “…” alt = “…” /> ). You’d find that the alt attribute has text prefixed like this “ Image may contain : ….. “. This technology is quite accurate too. This technology recognizes people, text, mountains, sky, trees, plants, outdoor and nature and many more. Another cool technology is that of Google’s. Go on to photos.google.com and type something in the search bar. Let’s say you’ve typed in “mountains”, then you’ll accurately get all of your photos in the search result that contains mountains. The same is true for Google Image Search too. The best part about image search is that, the reverse also works, i.e., you can upload an image and get the best possible description of the image and also get images that are similar to the uploaded image. This technology quite spot on too.
Ok then, I hope you would’ve gotten enough motivation by now. There definitely are a ton of other technology that I’m missing. In fact there are a lot more similar technology that, one blog post isn’t enough for me to fit in all of that.
Let’s now check out some of those computer vision problems !
Image classification basically just involves labelling an image based on the content of the image. There would generally be a fixed set of labels and your model will have to predict the label that best fits the image. This problem is definitely hard for a machine as, all it sees is just a stream of numbers in an image.
[The above picture was taken from Google Images]
And, there usually are lots of image classification contests happening around the world. Kaggle is a very nice place to find such hosted competitions. One of the most famous such contests is the ImageNet Challenge. ImageNet is basically a humungous repository of images (about 14 million of them at the time when this article was written) with over 20000 image tags. This is being maintained by the computer vision lab at Stanford University. The ImageNet challenge or the Large Scale Visual Recognition Challenge (LSVRC) is an annual contest that has various sub challenges such as object classification, object detection and object localization. The LSVRC, especially the object classification challenge, started gaining a lot of attention from the year 2012 when Alex Krizhevsky implemented the famous AlexNet, which stole the show by reducing the error rate on images to 15.7% (which at that time was never achieved). And, looking at the latest results, Microsoft’s ResNet has achieved an error rate of 3.57% and Google’s Inception-v3 has achieved 3.46% and Inception-v4 has gone even further.
[ That image originated in this paper written during 2017 by Alfredo Canziani, Adam Paszke and Eugenio Culurciello]
Object detection in an image involves recognizing various sub images and drawing a bounding box around each recognized sub image. Here’s an example :
[The above image was taken from Google Images]
This is slightly more complicated to solve as compared to classification. You’ll have to play around with the image co-ordinates a lot more here. The best known way to do detection right now is called Faster-RCNN. RCNN is Region Convolutional Neural Network. It uses a technique called Region Proposal Network, which is basically responsible for localizing on the regions in the image that need to be processed and classified. This RCNN was later tweaked and made more efficient and is now called Faster-RCNN. A Convolutional Neural Network is generally used as a part of the region proposal method to generate the regions. The most recent image-net challenge (LSVRC 2017) had an object detection challenge and was bagged by a teamed named “BDAT” which consisted of folks from Nanjing University of Information Science & Technology and Imperial College London.
Image segmentation involves partitioning an image based on the objects present, with accurate boundaries.
Image segmentation is of 2 types, Semantic segmentation and Instance segmentation. In Semantic segmentation, you’ll have to label each pixel by a class object. Essentially, in such a case, every object that belongs to the same class (say every cat), will be colored the same. Whereas in Instance segmentation, every object is classified differently. This means that every cat in a picture would be colored differently.
[The above image was taken from Google Images]
It can also be seen that Semantic segmentation is a subset of Instance segmentation. So, we’ll see how to solve Instance segmentation.
The latest known technique to solve this is called Mask R-CNN, which is basically a couple of convolutional layers on top of the R-CNN technique we saw earlier. Microsoft, Facebook and Mighty Ai have jointly released this dataset named COCO. It is similar to ImageNet, but is mainly for segmentation and detection.
This is one of the coolest computer vision problems with a tinge of natural language processing I’d say. This involves generating a caption that is most appropriate for your image.
[The above image was taken from Google Images]
Image captioning is basically Image Detection + Captioning. Image detection is done by the same Faster R-CNN method that we saw earlier. Captioning is done using a RNN (Recurrent Neural Net). To be very precise, LSTM (Long Short Term Memory), which is an advanced version of RNN is used. These RNNs are quite similar to our regular Deep Neural Networks except that, these RNNs depend on the previous state of the network. You can think of it more like a neural net with neurons building over time alongside space. Structurally, RNNs look like this :
Usually, these RNNs are used for problems in which your data is time dependent. For example, if you wanna predict the next word in a sentence, then the new word depends on all the words that showed up in the previous time steps.
Let’s now switch gears a little bit and look at human visual understanding.
Why are humans better at visual understanding ?
Before really going into the details about the majestic human brain, I’d like to discuss a downside of these deep neural nets.
Although Deep Neural Nets seem wonderful and magical, they can unfortunately be fooled easily. Take a look at this :
[The above image was taken from Andrej Karpathy’s Blog]
As the image shows, every image is super imposed with a noise image, which visually doesn’t change the original image at all and yet got misclassified as ostrich !
Such attacks are called the Adversarial Attacks on Deep Neural Nets. They were initially brought up by Szegedy et al. in the year 2013 and was then further investigated by Goodfellow et al. in the year 2014. It was basically found that, we can find a minimal noise signal by optimizing the pixel intensities in the image to give priority to a different class in the deep neural network instead of the current one. This engendered the growth of Generative Models. There are 3 well known generative models as of today, namely, Pixel RNN / Pixel CNN, Variational Auto-encoders and Generative Adversarial Networks.
Human Visual Understanding
Although we’ve come a long way in developing cool technology pertaining to computer vision, in the long term, humans are much better at image understanding than any technology. This is because, machines are quite narrow sighted in the sense that they just learn stuff by going through a fixed category of images. Although they might have learned from a massive quantity of images (typically about a million for image-net challenges), it isn’t anywhere close to what humans can do. I’d mainly attribute this to the human brain, precisely to the neocortex in the human brain. Neocortex is the part of the brain responsible for recognizing patterns, cognition and other higher order functions such as perception. Our brain is so intricately designed that, it helps us remembering stuff without just directly dumping the required data into memory as in the case of a hard disk. The brain rather stores patterns of the stuff that we witness and later retrieves them if necessary.
Besides, humans are continuously collecting data (for eg., collecting images through vision) at every point in their life unlike machines. Let’s take an example. Most of us see dogs almost everyday. And this also means that we would’ve seen dogs in different postures and from different angles. This means that, given an image which has dogs, there is a very very high probability that we’d be able to recognize a dog in the picture. This isn’t true for machines. Machines might’ve been trained only for a certain amount of dog images and hence can be fooled easily. If you feed in the image of the same dog with a slightly different posture, it might get misclassified.
Can A.I. really compete against “Human Brain” ?
Well, this been a very contentious topic in the past. Let’s analyze it !
In a talk by Jeff Dean, he mentioned the number of parameters that make up a Deep neural network for most of the published ones since 2011. And if you’d noticed, for humans, he mentioned “100 trillion”. Although he seemed to have considered that some what like a joke, that seems quite true given the amount of complex stuff the human brain can handle. Assuming that our brain is that complex, is it even practical to design a system with so many parameters ?
Well, it definitely is true that there were some major breakthroughs in the field of Artificial Intelligence like AlphaGo beating a world champion in the game of Go, OpenAI’s Dota2 bot beat experts in the game and many more. Yet, these sort of stuff seem very niche in the sense that, a Dota2 bot is very specific to Dota2 and not to anything else additionally. On the contrary, human brain is very generic. You’d use your brain for almost all of your day to day activities. From this what I’d like to infer is that, in order to compete with the mammalian brain, we’ll need a General Artificial Intelligence !
Some random thoughts
I’d say using Reinforcement Learning (RL) (specifically Deep Reinforcement Learning:DRL) takes us a step closer to solving general intelligence. In RL, an agent itself discovers optimal ways to take actions by fiddling around with the environment. This also seems analogical to how humans learn stuff. Humans learn to do things by getting to know if their action was correct or not. In the same way, in reinforcement learning, the agent performs random actions and each action has with it an associated reward. The agent learns from the reward that it gets for the actions, i.e., the agent picks an action in such a way that the total future reward it gets is maximized.
This is currently an active area of research and involves giants like DeepMind and OpenAI. In fact, DeepMind’s main motto is to “Solve General Artificial Intelligence” !