Lets Understand Object Recognition

Published in

TheCyPhy

5 min readMay 8, 2020

This blog is intended to give a flavor of what progress has happened over the years in the domain of object recognition using deep learning. It talks about Image classification, Object detection, Semantic segmentation, Instance Segmentation.

Object recognition basically boils down to identifying objects in digital photographs. It has seen 4 significant classes of algorithms over time.

Image classification: determine if an object is there in an image or not. We have had significant success in this department with CNNs.
Object localization/detection: the task is to determine the location of an object of interest or all the objects in an image usually in bounding boxes. Active research in this domain.
Object segmentation/semantic segmentation: determine the location of classes of objects at the pixel level. Active research in this domain.
Instance segmentation: determine the pixel level locations of different objects at an inter-class and intraclass level.

The order of complexity of the above-mentioned tasks are:

Image classification <Object detection <object segmentation < Image segmentation

Image classification

Given an image as input, the output is if an object is present or not in the input.

Image classification was considered a difficult task in the research community. The paper by A Krizhevsky et al. titled Image Net Classiﬁcation with Deep convolutional Neural Networks broke all the benchmarks. This is when the deep learning revolution began. It laid the foundations and presented with concepts as Relu activation function, Max pooling, and strides.

the architecture of Imagenet classification

We many pre-trained deep models at our disposal like Alexnet, VGG19, VGG16, ResNet50, DensNet, InceptionV3, etc. They have been trained on huge datasets and work well with real-world data.

Object Detection

Given an image as input, the output is where an object is present in the input usually in the form of bounding boxes.

Over the decade a lot of research has been put in object detection. We have seen reasonable advancements in object detection. Object detection techniques are used in OCR, Number plate reader technologies. Few notable techniques and papers are:

a. YOLOv4: Optimal Speed and Accuracy of Object Detection: YOLOv4 is the latest update in the versions of YOLO models by Alexey Bochkovskiy et al. YOLO stands for You Only Look Once. The model is incredibly fast and accurate. Yolo deduces the output in a single step, thereby you only look once. YOLOv3 presents an interesting concept of anchor boxes.

YOLO V4 in action on Mr. Bond.

b. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: Faster R-CNN is the latest update in the line of R-CNN. Faster R-CNN utilizes an RPN network to predict regions. Then these regions are merged with Fast R-CNN. Models based on regions tend to have good mean average precision but slow in testing.

Faster R-CNN with ResNet-101 architecture in action.

Semantic Segmentation

Given an image as input, the output is a mask with different shades for different class of objects in the input.

Semantic segmentation can be thought of as classifying each pixel in the input to its corresponding class label. Given that for every class, we have a corresponding color, thereby we have a mask. As we are predicting for every pixel, the task also referred to as dense prediction. Semantic segmentation is heavily studied from self-driving cars. Semantic segmentation is computationally very expensive, making it very hard for real-time predictions on non-specialized hardware.

a. U-Net: Convolutional Networks for Biomedical Image Segmentation: U-Net is a very popular model used for image segmentation. Olaf Ronneberger et al. proposed the model for medical image segmentation. The paper uses data augmentation techniques and a unique contracting- expanding architecture to accomplish the task. The main outline of the paper was to build a model that performed well with scarce data which is usually the case with medical data.

b. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs: Deeplabs have produced phenomenal research in deep learning over the years. The paper by Liang-Chieh Chen et al. brought in a lot of structural changes in how semantic segmentations were done. They discuss Atrous convolutions, Atrous spatial pyramid pooling, and Conditional Random Fields. They achieve the state of art results on the cityscapes segmentation dataset.

DeepLab V3 in action on road.

Instance Segmentation

Given an image as input, the output is a mask with different shades for every individual object in the picture.

Instance segmentation can be thought of as classifying each pixel in the input to its corresponding object label. Given that for every individual object in the picture, we have a corresponding mask. Instance segmentation can be thought of as object detection+ semantic segmentation.

“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.” — Joseph Redmon, author of YOLO object detection models.

A lot of research is actively taking place in instance segmentation. Few notable papers are:

a. YOLACT++ Better Real-time Instance Segmentation: YOLACT++ is the successor of YOLACT Real-time Instance Segmentation by Daniel Bolya et al. YOLACT++ is the lastest advancement with segmentations of >30fps. This makes it suitable for real-time applications.

YOLACT real-time instance segmentation video submitted at ICCV by authors.

b. Mask R-CNN: Mask R-CNN by Kaiming He et al was one of the first papers to discuss instance segmentations. Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

Mask R-CNN in action for Instance segmentation.

Let's sum it all up…

Object detection V/s Semantic segmentation V/s Instance segmentation. Credits

We have seen tremendous progress in dealing with images. We have come from just classifying if an image had an object to pixel level localization of the whole object from the image.
We are now at real-time instance segmentation at >30 fps. But all these algorithms are resource hungry. Running them for test cases requires expensive GPUs to get the inference.
The standards change every 2 years. We have seen so many structural changes in architecture over the decade. What seems to work now is discarded 2 years down the lane.

We are still far from making models and algorithms that are scalable to general use cases and general devices. Almost all the models are run on multiple GPUs and just fail for even high-end CPUs.

If you find any mistakes, please correct me. Leave a mail at mustaffahussain4734@gmail.com

Happy reading 🙂

Lets Understand Object Recognition

Let's sum it all up…

Written by Mustaffa Hussain