Fruit and Vegetable Detection and Feature Extraction using Instance Segmentation-Part 1

Prakruti Chandak
Codalyze
Published in
5 min readJun 4, 2019

About the Series:

The goal of the project is to build a system that is able to identify fruits and vegetables. Along with the identification, it should also be able to get the features of a particular category/class. We are assuming that we have a conveyor belt with fruits and vegetables on it. The input can be an image or video but in this case, we’ll be using images. The series will have the following parts:

PART I — Choosing the model and the dataset
PART II — Retraining the model according to the dataset

Task Description:

To classify and count the number of objects in the frame. Special types of image capture techniques can be used. For example thermography, hyperspectral imaging or tomography.

If the input image is of this type:

Our output might look like the image(below):

Choosing the dataset:

Initially, we used Kaggle360 dataset, which has 95 fruit classes and 103 images per class. It seems pretty convincing to use the dataset but as we went ahead with the project, we realised that the dataset could be used only for classification task and not for counting task. Therefore we shifted to MS COCO dataset.

Choosing the model:

We tried the hit-and-trial method over various algorithms to see which method works the best.

  • YOLO (You Look Only Once)
    The technique divides the image into grids and then assigns classes if objects are detected. As a result, we get overlapping bounding boxes. The resulting accuracy was lower than expected (the issue can be resolved by altering a few stages of the technique). Another issue that came up was the fact that YOLO works best with video (of 40–90 fps) rather than images.
  • Image Segmentation
    This technique segments the image into different blobs according to the pixel value, this group of blobs is then merged and as a result, we get a semantically segmented image. The technique doesn’t work well with low image contrast; we found that it was unable to segregate the foreground and the background. Also, the technique was unable to detect different objects with overlapping objects of similar colour.
  • Edge Detection (followed by closed loop identification)
    We tried to build a technique that would identify the edges of objects and then look for the number of closed loops for the detected edges. While refining the methodology, we found that it gave us similar issue as in image segmentation i.e objects in low contrast images couldn’t be detected. Another problem that we faced was maintaining a count of the closed loops; as the loops could be branches, leaves or shadows which gave us an inaccurate count.

We wanted a method that would work well with low contrast or low-resolution images, we needed it to be able to classify each object in the detected area. And that can be achieved by a combination of Semantic Segmentation and Object Detection, we require a model that works well with instance segmentation. Time and accuracy are the two factors that will be playing a major role.

Mask R-CNN
It outperforms all the existing, single model entries on every task. Also, Mask R-CNN is one of the best implementations for the MS COCO dataset for the year 2017, we’ll be making a few minor changes to make the algorithm suitable for our task.

Mask R-CNN is a simple, flexible and general framework for object instance segmentation. The approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN. Briefly, there are two functions performed by the technique:

  1. Object detection
    It is locating instances of the objects of a certain class. It is done using Faster R-CNN (Faster Region Convolutional Neural Network)
  2. Semantic Segmentation
    It is detecting the boundary and features of the object pixel-by-pixel or we can say it is used to associate each pixel with the class label. It is done using FCN (Fully convolutional network).
Workflow of MaskRCNN

The image is first divided into ROI (Region of Interest) using FPN (Feature Pyramid Network), once it gets ROIs, it labels and pools the images to get better performance. The pooling is done using ROIAlign, which helps in reducing data loss. Labelling and Instance segmentation (the second part) works in parallel, saving some time. The masks generated are binary masks and are decoupled with the RoI.

Summary
Mask R-CNN has the following properties which make it the best-suited technique for the images:

  • ROI Align — reduces data losses
  • Decoupled binary mask with class — reduced space taken by the result as the mask is in the binary form
  • Parallel object detection and generation — reduces the time taken for every classification

Referencing

--

--