An intuitive journey to Object Detection through Human Vision, Computer Vision and CNN’s

Jim James
CodeX
Published in
11 min readAug 15, 2021

Why Computer Vision(CV)?

Next time when you are taking that beautiful selfie of yours or you are in your brand new Tesla or you are amazed by robots dancing to ‘Do you Love Me?’, CV is at the heart of it. The applications further extend to facial detection and recognition, identifying breast cancer, skin cancer and Covid-19 from scans and X-rays better than a trained physician.

Human Vision

Courtesy: https://www.pexels.com/

Have a quick glance at this photo and what do you see? Possibly a group from friends from college, in a new country, at a train or bus station, trying to figure out the schedule to their next destination. Our brain takes less than a split second to provide us with so much information that is not explicit here.

How is it even possible? Thanks to millions of years of evolution, all of us are blessed with one of the most complex systems on earth, the human brain with ~80 billion neurons. Out of the 5 senses that our brain is constantly processing and integrating more than half of the processing power is devoted to just one sense, the vision.

How humans see?

Source

It’s all about light. Light reflects off an object, and if that object is in your field of vision, it enters the eye (1,2,3)

When light hits the retina (a light-sensitive layer of tissue at the back of the eye), special cells called photoreceptors turn the light into electrical signals(4,5).These electrical signals travel from the retina through the optic nerve to the brain. Then the brain turns the signals into the images you see(6)

Hierarchical Human Vision

One of the most influential papers in human/animal vision that inspired computer vision as well was published by two neurophysiologists — David Hubel and Torsten Wiesel — in 1959(Video here). Their publication, entitled “Receptive fields of single neurons in the cat’s striate cortex”, focused on how the visual neurons in a cat’s brain responded to various shapes.

Source: https://goodpsychology.wordpress.com/2013/03/13/235/

They placed electrodes into the primary visual cortex area of an anesthetized cat’s brain and observed, or at least tried to, the neuronal activity in that region while showing the animal various images. The researchers established, through their experimentation, that there are simple and complex neurons in the primary visual cortex and that visual processing always starts with simple structures such as oriented edges.

In 1982, David Marr, a British neuroscientist, published another influential paper — “Vision: A computational investigation into the human representation and processing of visual information”.
Building on the ideas of Hubel and Wiesel (who discovered that vision processing always starts with simple structures such as oriented edges.), David gave us the next important insight: He established that vision is hierarchical. The vision system’s main function, he argued, is to create 3D representations of the environment so we can interact with it.

Coutesy: http://cs231n.stanford.edu/

He introduced a framework for vision where low-level algorithms that detect edges, curves, corners, etc., are used as stepping stones towards a high-level understanding of visual data.

David Marr’s representational framework for vision includes:

  • A Primal Sketch of an image, where edges, bars, boundaries etc., are represented (this is clearly inspired by Hubel and Wiesel’s research).
  • A 2½D sketch representation where surfaces, information about depth and discontinuities on an image are pieced together.
  • A 3D model that is hierarchically organized in terms of surface and volumetric primitives.

Mapping this theory to the actual brain, the primary visual cortex you see it denoted this v1 in this green area in the back of the brain is the first layer that does image processing and start with edge detection. The subsequent layers aggregate the information from V1 and perform progressively complex tasks to achieve the final goal of vision.

Source: https://figshare.com/articles/dataset/Ventral_visual_stream/106794

Computer Vision

Let’s switch gears and move to our topic of the day ‘Computer Vision’ and our focus is on the ability of a computer to recognise the object in the image which is commonly known as a ‘Classification Problem’ i.e., if the given image is a cat or a dog?

Source

While the problem seems as a basic image classification exercise, it is one of the most fundamental building blocks of complex tasks in computer vision as below.

Coutesy: http://cs231n.stanford.edu/

How Computers See?

While humas see objects as light reflected from the objects, computers see images as numbers.

Below is a simple illustration of the grayscale image buffer which stores our image of Abraham Lincoln. Each pixel’s brightness is represented by a single 8-bit number, whose range is from 0 (black) to 255 (white)

Courtesy: http://introtodeeplearning.com/ (MIT)

This should have given you a very good idea on how we can solve the problem of image classification. If you have this number equivalent of a cat image(Cat A), then for a new image presented(Image A), if we compare its number representation to the Cat A and if the numbers are found to be similar then Image A should also have a cat right?

This is exactly how an early CV algorithm worked that was named ‘K-Nearest Neighbour’ algorithm. This was found to be 38.6% effective on a standardised image classification test dataset called CIFAR-10 dataset. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labelled with one of 10 classes (for example “airplane, automobile, bird, etc”). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images.

In the image below on the left you can see 10 random example images from each one of the 10 classes:

Coutesy: http://cs231n.stanford.edu/

Following image has the first column shows a few test images and next to each are the top 10 nearest neighbours in the training set according to pixel-wise difference.

Coutesy: http://cs231n.stanford.edu/

However, there is something REALLY WRONG here. The first image is a boat but the first match to it comes up as a bird. Also, the third image is that of a frog but is matched to a cat first!!!

YES, this is the major flaw with this algorithm. In the first case the algorithm would have got confused as the outline of the boat looks like the bird in matched image and in the second case colour and postures are matching. These further highlights one of the key challenges in computer vision where similar objects are subjected to various settings.

Coutesy: http://cs231n.stanford.edu/

To work around this, we will have to find a way to extract the features of a human face from an image such the nose, lips, eyes, ears etc then by combining these features we can identify the object as a human irrespective of the setting it is in, which we saw is very similar to how a human brain/human vision works. This is where our main topic of our today’s discussion comes into picture ‘Convolutional Neural Networks (CNN’s)’.

Convolutional Neural Networks (CNN’s)

On a very high-level CNN’s focuses on two key areas

1) Extract the low-level features (lines, edges etc) from the image.
2) Build-up high-level features (nose, lips, eyes, ears etc) from these low-level features.

1) Extract low-level features

Source:https://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/

A filter (with red outline) slides over the input image (convolution operation) to produce a set of features (feature map). Another filter (with the green outline), slides over the same image gives a different feature map as shown. It is important to note that the convolution operation captures the local dependencies/low-level features in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.

For more mathematical details refer this great article https://mlnotebook.github.io/post/CNN1/

This could be a good time to introduce how an actual Convolutional Neural Network(CNN) looks like

Fig 1.1 — source

CNN’s are multiple ‘Convolution Modules’ stacked together, with the final classification layer determining the actual content of the image e.g. cat or a dog.
Conv. Module #1 can be considered as the one that is extracting the low-level features.

2) Build-up high-level features (nose, lips, eyes, ears etc) from these low-level features.
The low-level features are composed together to build the mid-level features and then in later layers higher level features, resembling the human vison hierarchy.

Coutesy: http://cs231n.stanford.edu/

The obvious question here is, how does the CNN know the template(filter) to apply as this could be varying for different types of images and how does the CNN come up with the mid/low level features that so accurately represents the input image? This is where the learning part of the Machine learning comes in to play. As you would have rightly guessed this does not happen automatically as soon as the CNN sees an image.

The CNN goes through a training (supervised learning) process where it is shown a bunch of different cat images which it goes through them one at a time. Before the process starts the templates are blank slates and as it process the first image through different layers of CNN,
Conv. Module #1 →Conv. Module #2 →Classification (as in fig 1.1)
almost no low/mid/high level feature identification happens and the confidence of the CNN to predict this as a cat image will be very low.

However, a feedback is sent back through the layers as,
Classification →Conv. Module #2 →Conv. Module #1 (as in fig 1.1)
on how good the ‘prediction’ is compared to the ‘actual’ image. This feedback will adjust across the layer and make CNN a bit smarter. The process is repeated for all the images and the batch of images is processed multiple times enhancing the learning very time.

Consider this as teaching a kid how an apple looks like. The first time we show her an image of an apple she might call it as a ball. But then we correct her, and she might capture one feature in that image, let’s say it is round. Next time we show her the same image she might call it as an orange, close but not quite there. We then reinforce it as an apple so she might get and additional feature that it is red in colour. The same process repeated over time helps her to identify the most relevant features of an apple and the knowledge embeds in her forever.

Check out the amazing visualizations of a Convolutional Neural Network(CNN) trained on the MNIST Database of handwritten digits by Adam Harley.

If you are interested in a deep dive of CNN’s refer this article.

Object Detection

Now that we have a good intuition on image classification lets extend the concept to an advanced problem of object detection.

Object detection is a computer vision technique that works to identify and locate objects within an image or video. Specifically, object detection draws bounding boxes around these detected objects, which allow us to locate where said objects are in (or how they move through) a given scene.

Coutesy: http://cs231n.stanford.edu/

Based on what we have seen till now, If there is a way by which we can identify the boundaries of every object in the image then we can crop that part and use image classification algorithm to identify the object in that part.

On a very high level two approaches are considered to identify the object boundaries in an image.

1.Sliding Window Detectors — Windows of varied sizes and aspect ratios are slid over an image from top to bottom and left to right. We cut out patches from the picture according to the sliding windows. The patches are warped since many classifiers take fixed size images only.

Source

2.Selective Search — Instead of a brute force approach as in ‘Sliding Window Detectors’, we use a region proposal method to create regions of interest (ROIs) for object detection. In selective search (SS), we start with each individual pixel as its own group. Next, we calculate the texture for each group and combine two that are the closest. But to avoid a single region in gobbling others, we prefer grouping smaller group first. We continue merging regions until everything is combined. In the first row below, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.

Source: van de Sande et al. ICCV’11

There are mainly two types of Object detection algorithms

1.Region based object detectors (Faster R-CNN, R-FCN, FPN) — First they identify the objects in an image and later classify them. Relatively slower but more accurate.
2.Single shot object detectors (SSD, YOLO) — The identify the objects and classify them in parallel. Faster but relatively less accurate.

If you are interested in learning about object detection algorithms in detail check out this article.

Hands-On

Detectron2 by Facebook is a quick and easy way to get started with your object detection journey(Colab Git).Check out this article if you want to apply Object detection to your custom datasets.

Applications

Autonomous Driving

Source:Unsplash

Medical Imaging

DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning

Robotics

Robots that clean up after us

Recent Trends

3D Object Detection: The detection of 3D objects has its own requirements. For instance, these objects do not follow any specific orientation, and this poses considerable challenges. Certain advances have been made in recent years, but a lot remains to be done to consistently achieve high performance.

Real-time, high-speed Detection: Object detection is resource-intensive, both in terms of human intervention and model processing (compute). As a result, high-speed detection in real-time, particularly for mobile devices, is an important development area.

Small-Object Detection: Most detectors struggle with small objects. The inaccuracies in small-object detection are considerably higher than those related to the medium or big sized ones

Video-based Object Detection: Modern object detection is primarily designed for images, and not explicitly for videos. As a result, videos require to be chunked into individual frames before detection can happen. This creates inefficiencies like delays in detection, overheads in converting the videos into frames, and non-consideration of the frame-level relationships. Addressing these aspects is a critical area of current and future development.

Deep-Dive References

MIT 6.S191: Introduction to Deep Learninghttps://github.com/jim-j-james/introtodeeplearning

Stanford CS231n: Convolutional Neural Networks for Visual Recognitionhttps://github.com/jim-j-james/cs231n

Padhai, IIT Madrashttps://github.com/jim-j-james/PadhAI_Deep_Learning

  • * These views are my own and do not represent the views of my employer **

--

--