Object detection vs. Image segmentation

Published in

Inovako

11 min readJul 5, 2020

In this article, I aim to compare and contrast object detection and image segmentation, and perhaps help you decide which technique to use based on the needs of the application we want to build.

Before diving into comparing these two techniques, I want to first explain what these techniques actually are and what we mean when we talk about object detection or image segmentation.

This article is made up of three parts.

Part 1: object detection

Part 2: image segmentation

Part 3: comparing object detection and image segmentation

For people who want to learn more about certain parts over others, feel free to skip ahead and continue with the part you are interested in.

For others who want to follow along the whole journey, let’s dive into object detection before further due.

Part 1: Object detection

We will divide the object detection section into three parts: its definitions, approaches, and applications.

A. Definitions

Object detection has two main purposes:

Identify the object within an image
Locate the object within an image

More specifically, object detection draws bounding boxes around the objects detected with their class tag (the prediction of the model about what that object is). This gives information about the location of the detected object as well as what that object is.

Here’s a picture that differentiates object detection from classification + localization:

The main difference between the object detection and classification + localization is that classification+localization works only for one object per image, while object detection can detect multiple objects within the image.

So, how does the object detection work? Let’s look into how we can approach object detection.

B. Approaches

CNN

Perhaps the first thing that comes to mind when dealing with object detection is to apply CNN (convolutional neural networks) all over our images to detect objects. Here’s an example of how that might look like in a real image:

The problem with this approach is that we need to apply CNN to a huge number of locations, scales, and aspect ratios. This is computationally expensive as well as can be considered arbitrary in terms of which locations the model is attempting to find objects.

Region Proposals /Selective Search

A more eloquent way to approach object detection is not to apply CNN to a huge number of arbitrary locations, but be more purposeful about which locations we are applying CNN. This is exactly what Region Proposals do: find image regions that are likely to contain objects. Here’s an example of how that might look like in a real image:

Region proposal approaches are much faster to run than the previous CNN approach. In the Region Proposal approaches, Regions of Interest (RoI), which are the possible regions that might contain objects, are extracted first.

Let’s now look into types of region proposals.

R-CNN

In R-CNN regions of interest are extracted based on selective search, and approximately 2000 regions are produced. After that, these regions are warped and passed into CNN for feature extraction. The regions are then classified using SVM (support vector machine) and the bounding boxes are decided using regression. You can see this architecture in the image below.

The fact that RoI is proposed in a more sophisticated way in R-CNN than it is in the CNN approach makes R-CNN faster. Yet, there are a couple of limitations with this model. First of all, it uses selective search, which is an exhaustive algorithm. Secondly, the selective search doesn’t have any learning involved, so its accuracy is limited. Thirdly, considering that selective search produces approximately 2000 regions per image, CNN would be used in 2000*N regions, given that we have N number of images. This is a lot, especially since all these features are cached and therefore end up occupying large space in the disk¹.

Indeed, there are 3 models used for each Region of Interest: CNN for feature extraction, Linear SVM for identifying objects, and regression for tightening the bounding boxes. Applying these 3 models per region of interest in each image causes the RCNN to be very slow in training. Detection (testing) is also slow, approximately 47s/image. This is especially problematic if you have huge datasets.

2. Fast R-CNN

Fast R-CNN uses a similar approach to R-CNN but at the same time overcomes some of the limitations that R-CNN has.

Instead of passing each region of Interest into CNN as R-CNN does, Fast R-CNN takes the whole image as input, processes the image with several convolutions, and generates a convolutional feature map². Then, the Pooling Layer extracts fixed-length feature vectors from the feature map. These feature vectors are then fed into fully connected layers. From the fully-connected layers, softmax classification is used to estimate the K number of object classes as well as the refined positions of bounding-boxes. After that, the multi-task loss is used. You can see the image below.

We can summarize the contributions of the Fast R-CNN as follows. The biggest contribution is that Fast R-CNN jointly learns to classify object proposals and refine their spatial locations. In Fast R-CNN, one image does not go under CNN training for approximately 2000 region proposals like R-CNN does. Instead, one image goes under one CNN training, which produces a feature map and it is fed into the next steps. Using the multi-task loss, the training becomes a single stage, which is a massive improvement compared to R-CNN. The training can update all network layers as well. Furthermore, no disk storage is needed for feature caching like CNN requires us to have. To detect an image takes 2 seconds per image, which is a massive improvement compared to R-CNN.

Yet, there is still one great limitation with Fast R-CNN: the time it takes to propose regions. Fast R-CNN still uses selective search so the limitations we express in the R-CNN selective search part still remain. The fact that the runtime is dominated by region proposals is the biggest bottleneck.

3. Faster R-CNN

Faster R-CNN solves the bottleneck problem that Fast R-CNN has. Faster R-CNN utilizes convolutional feature maps but moreover builds RPN (Region Proposal Network) by adding additional convolutional layers. RPN functions as a fully convolutional network and can be trained end-to-end to efficiently generate detection proposals. Region proposals are then passed into pooling layers. RPN can regress bounding-boxes and classify objectness simultaneously at each location³.

As the RPN replaces the exhaustive selective search, it improves the bottleneck problem of slow and inefficient region proposals. Indeed, the marginal cost for computing proposals becomes very small (10ms per image).

You can further see in the graph above how the test-time decreased from high values in R-CNN to lower values with Fast and Faster R-CNN.

Faster R-CNN overcomes the limitations of CNN, R-CNN, and Fast R-CNN for sure. Yet, there are still a couple of areas Faster R-CNN falls short. One of the fallbacks of the Faster R-CNN is that it makes many passes to a single image to extract all the features and objects, and this can be inefficient.

YOLO

The region proposal methods above first generate regions of interest in an image and run the classifier on these regions. After classification, there is a further effort in eliminating duplicate detection, regress the boxes, and rescore the boxes. These pipelines are complex, require multiple passes in each image, and hard to optimize.

On the other hand, YOLO is different from the Region Proposals methods in one huge aspect. In YOLO, you only look once (YOLO) and predict objects while locating them. YOLO treats object detection as a regression problem: a single convolutional network that simultaneously predicts multiple bounding boxes, and class probabilities for those boxes.⁴ Then these predictions are encoded as a tensor and produce final detections as shown below.

YOLO trains on the full image and optimizes the performance. The biggest contribution of YOLO is that it is extremely fast. The lack of a complex pipeline and framing detection as a regression problem increases the speed immensely. Its fast version can run at more than 150 frames per second, meaning it can stream video in real-time with a latency of less than 25 ms.⁴ This means that the YOLO seems like the perfect candidate for real-time object detection problems. The network is also small, making YOLO preferable for on-device environments.⁵

C. Applications

Here are some applications of object detection.

Face detection
Surveillance and security: Face recognition, vehicle tracking, activity recognition
Anomaly detection
Medical image processing
Manufacturing industry
Image retrieval
Self-driving cars
Crowd counting

As we’ve seen in the models above, the object detection aims to draw bounding boxes at the location where the object is found. So, if you want to know more about the shape or curvature of the object, object detection is not enough. Object detection falls short in measuring the object as well, it does not give information about the area/perimeter of the object.

Part 2: Image segmentation

A. Definitions

We can consider image segmentation as a further extension of object detection since we can detect objects through image segmentation as we use pixel-wise masks for each image. Image segmentation helps us gain a more particular understanding of the shapes/curves of objects and also know what class each pixel in the image belongs to.

B. Approaches

We will now go over semantic segmentation and instance segmentation

Semantic Segmentation

Semantic segmentation labels each pixel with its class label and does not differentiate about instances. For example in the image below, two cow instances are both classified as class “cow”.

For semantic segmentation, we want our model to produce high-resolution semantic segmentation. One way semantic segmentation can work is through encoder/decoder structure which utilizes downsampling and upsampling. Downsampling (pooling, strided convolution) produces lower-resolution feature mappings which help with differentiating the classes. Upsampling (un-pooling or strided transpose convolution) produces a higher-resolution segmented image.¹

2. Instance Segmentation

Like semantic segmentation, instance segmentation also uses pixel-pixel segmentation mask and, additionally, predicts the instances of the object. You can see the example below.

One popular model for instance segmentation is Mask R-CNN.

Mask R-CNN

Mask R-CNN extends the Faster R-CNN by adding a branch to predict the segmentation masks for each Region of Interest. The mask layer is basically a small Fully connected network (FCN) applied to each RoI and it predicts a pixel-to-pixel segmentation mask. This happens in parallel with the existing branch for classification and bounding box regression.⁶

The mask branch adds a small computational expense to Faster R-CNN, so we can consider Mask R-CNN a little bit slower than Faster R-CNN. When Mask R-CNN was released, it surpassed all the state-of-art single-model on segmentation tasks on COCO instances. Below is how Mask R-CNN results on the COCO test set.

Limitations of Mask R-CNN is that it is slow for real-time detection. Furthermore, the masks have fixed resolution therefore it can be hard to segment precisely for large objects with complex shapes.

C. Applications

Medical imaging -3D rendered scan
Object detection — pedestrian, face, brake light detection
Recognition — face, fingerprint, iris
Content-based retrieval
Social media camera filters

Image segmentation can specifically help if you want to have more information about each segment of an object. For example, when you are reconstructing a 3D image or a 3D model, it greatly helps to understand each segment of the object you are basing your model on.

Part 3: Compare and Contrast: Object detection vs Image segmentation

Now we have more knowledge about object detection and image segmentation, let’s compare these two techniques.

Whether you should use object detection or image segmentation really depends on your purpose. For example, if you want to simply detect the presence of an object in an image, then both object detection and image segmentation would work. Then, you would need to ask other questions such as would you like to have more information about the shape/size of the object, or does it not matter? If you need more information, then you can look into the image segmentation models. If not, then object detection works for you too.

Here are a couple of questions that can help you differentiate object detection and image segmentation further:

Do you want/need to know the shape/curvature/measurements/size of the object you are detecting in an image?
Do you want/need pixel by pixel information about the image? For example, do you want to know what class label each pixel belongs to? Does this pixel belong to the grass in the background or the cow in the foreground?
Does knowing about specific segments of the object help you? For applications such as content-based retrieval or reconstruction of an image in 3-D, knowing each segment of the object is very useful. In such cases, it greatly helps to apply image segmentation.
Once you figure out whether you will be using object detection or image segmentation, you can ask further questions such as whether you want your application to work in real-time or not, and research more about the available models based on your needs.

I hope you enjoyed this article, and if you did feel free to give a run of applause!

Hoping to see you in our upcoming articles.

References

[1]: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf

[2]: https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf

[3]: https://arxiv.org/pdf/1506.01497.pdf

[4]: https://arxiv.org/pdf/1506.02640.pdf

[5]: https://blog.roboflow.ai/yolov5-is-here/

[6]: [https://arxiv.org/pdf/1703.06870v3.pdf]