Segmentation for Creating Maps

yodayoda
Map for Robots
Published in
15 min readFeb 3, 2021

Image segmentation is one of the fundamental steps toward scene understanding of machines. By doing image segmentation, machines transit from the abstract image categorization toward more grounded pixel-level classification, in which each pixel is labeled by considering its local neighborhood, image context, scene composition as well as available low-level (pixel relations) and high-level knowledge (e.g., ontologies, object relations). There are numerous applications for image segmentation, including medical image analysis (e.g., chest X-ray inflammation segmentation), autonomous vehicles, video surveillance, and augmented reality.

Image segmentation is designed to work with many inputs such as 2D segmentation that mostly deals with images, 3D segmentation that works with 3D images (e.g., MRI scans of the brain) and point clouds (e.g., obtained by Kinect or Lidar), and videos that provide a stream of 2D or 3D snapshots of the environment with fixed time intervals that allow for high-level assumptions such as motion permanence. On the other hand, segmentation can be extended by detecting and delineating each object of interest in the image (e.g., partitioning of individual persons) in a joint task called instance segmentation. In semantic segmentation, the goal is to classify each pixel into the given classes, whereas, in the instance, segmentation, we care about the segmentation of the instances of objects separately. The panoptic segmentation further combines semantic and instance segmentation such that all pixels are assigned a class label, and all object instances are uniquely segmented.

There are different levels of scene understanding in computer vision ranging from identifying the object category in the image to the pixel-level annotation of all objects, distinguishing them from each other, and recognizing their corresponding categories.

Segmentation Datasets Useful for Map Creation

To train a data-hungry neural network to perform image segmentation, many data should be fed to the networks. There are plenty of datasets with different sizes, number of object categories, benchmark metrics, online leaderboards, and diversity of objects available to date. Here we introduce a few of the most popular datasets for 2D image segmentation. We are admittedly a little bit biased toward datasets containing urban outdoor scenes!

  • PASCAL Visual Object Classes (VOC) 2012 is one of the most popular datasets in computer vision, with annotated images available for 5 tasks — classification, segmentation, detection, action recognition, and person layout. It contains 21 classes of object labels, training and validation datasets of ~1.5k each, and a held-out test set for the challenge. mIoU is used for this challenge series.
    In the plot below, you can see the performance evolution of the segmentation algorithms trying to beat this particular dataset. A perfect algorithm that can correctly determine every instance of the objects in the dataset would reach 100% on that scale.
Semantic Segmentation on PASCAL VOC 2012 test as of October 2020
  • PASCAL Context 2014 contains pixel-wise labels for all training images of PASCAL VOC 2010. It contains more than 400 classes of three broad categories (objects, stuff, and hybrids). As some categories of this dataset are too sparse, a subset of 59 frequent classes is usually selected for training. There are 10k data available for train, dev, and test, and the main metrics of this challenge are mIoU and PixAcc.
Semantic Segmentation on PASCAL Context as of October 2020
  • Microsoft Common Objects in Context (MS COCO) includes images of common objects in their natural contexts and complex scenes. It contains photos of 91 objects types, with 2.5 million labeled instances in 328k images. It has been used mainly for segmenting individual object instances. Early versions of MS COCO contain 200k images of 500k object instances where Average Precision (AP) and Average Recall (AR) were used for evaluation. Recently the dataset provides Panoptic Segmentation data with 80 “thing” categories from the detection task and a subset of the 91 “stuff” categories from the stuff task, with any overlaps manually resolved. The Panoptic Quality (PQ) metric is used for performance evaluation.
  • ADE20K /MIT Scene Parsing (SceneParse150) contains more than 20K indoor and outdoor scenes exhaustively annotated with 150 categories of objects and object parts. The benchmark is divided into 20K images for training, 2K images for validation, and another batch of images for testing.
Semantic Segmentation on ADE20K val as of October 2020
  • SiftFlow includes 2,688 annotated images of size 256x256 from a subset of the LabelMe database from 8 outdoor scenes and 33 semantic classes.
  • FSS-1000 is a 1000-class dataset for few-shot segmentation containing 10k images with pixel-wise segmentation labels.
Few-Shot Semantic Segmentation on FSS-1000 as of October 2020
  • Cityscapes dataset contains urban street scenes in the form of stereo video sequences recorded in 50 cities, with high-quality pixel-level annotation of 5k frames and 20k weakly annotated ones. It contains 29 classes and uses IoU and iIoU metrics for pixel-level semantic labeling tasks.
Semantic Segmentation on Cityscapes test as of October 2020
  • VIPER is a set of 2500-frame panoptic labels that temporally extend the 500 Cityscapes image-panoptic labels. There are 3000-frame panoptic labels that correspond to 5, 10, 15, 20, 25, and 30th frames of each 500 videos, where all instance ids are associated over time.
  • KITTI is one of the most popular datasets for mobile robotics and autonomous driving. It contains hours of videos of traffic scenarios, recorded with a variety of sensor modalities. The KITTI semantic consists of 200 semantically annotated train as well as 200 test images corresponding to the KITTI Stereo and Flow Benchmark 2015 and is benchmarked with IoU and iIoU on classes and categories. KITTI 360 is an extension to this dataset in which 11 individual sequences, each corresponding to a continuous driving trajectory, including raw data, semantic, and instance labels in both 2D and 3D are provided.
Panoptic Segmentation on KITTI Panoptic Segmentation as of October 2020
  • UC Berkley 100K videos are publicly available videos that provide a rich treasure trove to work from as they cover a multitude of different weather conditions from sunny, rainy, and even hazy. The balance between day and night time conditions has also been praised. In addition to building self-driving cars, the dataset offers the opportunity for detecting pedestrians on the roads/pavements. There are more than 85,000 instances of pedestrians in the video, which gives a solid database for this exercise.
Semantic Segmentation on BDD as of October 2020 (the dataset is relatively new!)
  • Mapillary Vistas Dataset (MVD) is a diverse street-level imagery dataset with pixel‑accurate and instance‑specific human annotations for understanding street scenes around the world. It contains 25,000 high-resolution images, 152 object categories from which 100 categories are instance-specifically annotated.
Panoptic Segmentation on Mapillary dataset as of October 2020
  • India Driving dataset consists of 10,000 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. This dataset aims to provide complementary data to the conventional road scene understanding dataset with unstructured environments where well-delineated infrastructure such as lanes, well-defined categories for traffic participants, low variation in the object or background appearance, and strong adherence to traffic rules are largely not satisfied.
  • WildDash is a benchmark for semantic and instance segmentation. It aims to improve the expressiveness of performance evaluation for computer vision algorithms in regard to their robustness under real-world conditions. This dataset contains more than 5000 traffic scenarios from city, highway, and rural locations from more than 100 countries under a variety of weather conditions.
  • ApolloScape is a dataset that contains 146,997 video frames with corresponding pixel-level annotations and pose information, containing 25 object categories.
Semantic Segmentation on Apolloscape as of October 2020
  • nuScenes is a large-scale autonomous driving dataset with image-level 2d annotations. It features 800k annotated foreground objects with instance masks and 100k 2D semantic segmentation masks for background classes from two distinct cities.
  • Argoverse 3D Tracking dataset is a collection of 100+ log segments with 3D object tracking annotations. These log segments, which we call “sequences”, vary in length from 15 to 30 seconds and collectively contain a total of 11k tracks. This dataset provides amodal 3D cuboids for the tracked objects (that provide information about occlusions) and can be used for the segmentation of target objects. The complementary dataset, Argoverse Motion Forecasting is a curated collection of 320k+ scenarios, each 5 seconds long, selecting the most challenging segments of 1000 hours of driving data — including segments that show vehicles at intersections, vehicles taking left or right turns, and vehicles changing lanes.
  • DDAD is an autonomous driving benchmark from Toyota Research Institute for long-range and dense depth estimation in challenging and diverse urban conditions. It contains 235 scenes from urban settings in the US and Japan, around 2.5k+ frames with panoptic segmentation labels (i.e. semantic and instance segmentation) on top of the 360-degree accurate ground truth for depth estimation.

Other Datasets are available for image segmentation purposes too, such as Semantic Boundaries Dataset (SBD), PASCAL Part, SYNTHIA, Berkeley Segmentation Dataset (BSD), Youtube-Objects, ScanNet.

Common Metrics to Evaluate Segmentation

IoU: Most benchmarks rank all methods according to the PASCAL VOC intersection-over-union metric (IoU):

where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively. Some benchmarks such as CityScapes, also use instance-level intersection over union

In contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class’ average instance size to the size of the respective ground truth instance.

  • IoU class: Intersection over Union for each class IoU=TP/(TP+FP+FN)
  • iIoU class: Instance Intersection over Union iIoU=iTP/(iTP+iFP+iFN)
  • IoU category: Intersection over Union for each category IoU=TP/(TP+FP+FN)
  • iIoU category: Instance Intersection over Union for each category iIoU=iTP/(iTP+iFP+iFN)

Precision (P): We will need Recall and Precision below. Precision is the fraction of correctly predicted pixels or classes:

Recall (R): Recall describes how much of the truth we could recover:

Average Precision (AP): In the VOC metric, we first need to rank all examples descending in precision and ascending in recall. If we plot Precision over Recall we obtain a zig-zag curve (example). The area under this PR curve is called Average Precision (AP). For prediction problems with multiple classes of objects, this value is then averaged over all of the classes. The AP summarizes the shape of the precision-recall curve, and, in VOC 2007, it is defined as the mean of precision values at a set of 11 equally spaced recall levels. In contrast, Pascal VOC 2010–2012 uses (all points) of the area under the curve (AUC) on the PR curve. MS COCO uses 101 Recall points on the PR curve as well as different IoU thresholds. There are other metrics such as AP at IoU=0.75, that is a strict metric.

Panoptic Quality (PQ): Panoptic segmentation aims to unify the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). PQ evaluates performance for all categories, including both stuff and thing categories, in a unified manner and involves two steps: (1) segment matching and (2) PQ computation given the matches. After matching, each segment falls into one of three sets: TP (matched pairs), FP (unmatched predicted segments), and FN (unmatched ground truth segments), and PQ is calculated as

where p and g denote the pixel-wise estimated and ground truth labels respectively.

Backbone Networks for Segmentation

In a dramatic transition of the computer vision field, CNNs overtake the handcrafted features that were used in many vision tasks, including image segmentation for many years. CNNs utilized a hierarchical receptive field model and was initially proposed by Fukushima et al. (1980), and is popularized in computer vision by LeCun et al. (1998) for the famous MNIST digit recognition task. CNNs came back into attention by the success of AlexNet on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in September 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. The original paper’s primary result was that the depth of the model was essential for its high performance, which was computationally expensive but made possible due to the utilization of GPUs during training. Numerous architectures for CNNs have been proposed since, increasing the depth, refining the architecture, and advancing the computational blocks in these networks. In addition to ever-increasing hardware power and the size of annotated datasets, this progress is the result of new ideas and algorithms that appeared throughout a short span of time in recent years. Here, we list some of the most iconic versions of CNNs that are frequently used as the backbone for image segmentation tasks. In the next blog posts, we will show how they are incorporated in the image segmentation pipeline and what they offer to promote labeling accuracy.

LeNet-5

We start from LeNet-5 as one of the simplest forms of CNNs, as we know today. This architecture has become the standard ‘template’ for CNNs: stacking convolutions and pooling layers and ending the network with one or more fully-connected layers. It has 2 convolutional and 3 fully-connected layers. The average-pooling layer, as we know it now, was called a sub-sampling layer, and it had trainable weights in the early versions of CNNs. The network is named after Yann LeCun, the corresponding author of the paper.

LeCun et al., Gradient-Based Learning Applied to Document Recognition, http://yann.lecun.com/exdb/publis/index.html#lecun-98

AlexNet (NeurIPS 2012)

AlexNet stacked three more convolution layers on LeNet, that allows the network to accept 224x224 images that can actually contain some meaningful everyday objects in the accepted image. The paper was presented in NeurIPS 2012 and drew lots of attention by winning ILSVRC 2012 with the largest CNN applied to ImageNet to that date. The authors proposed ReLU (Rectified Linear Units) as an activation function that is a simplified imitation of the Sigmoid function to reduce the computation and facilitate the use of matrix manipulations offered by GPUs. The network is named after Alex Krizhevsky, the first author of the paper.

Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012, [1]

VGG-16 (ArXiv 2014)

One of the first ways to improve the performance of CNNs was to make them deeper. One way to do that is to cascade a few convolutional layers before reducing their dimensionality by a pooling layer. VGG network consists of 13 convolutional and 3 fully connected layers, with ReLU activation and smaller filter size compared to AlexNet. The network is named after the Vision Geometry Group (VGG) at Oxford University.

Simonyan & Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014, [1]

Inception-v1 (CVPR 2015)

This network improved utilization of the computing resources inside the network, as it is obtained through some research on approximating sparse structures. This network utilizes different existing ideas and combines them to create a high-performance network for object detection. This network uses the idea of “Network in Network” and reformulates the CNN design as stacking modules instead of convolutional layers. This network also popularizes the idea of having parallel “towers” of convolutions with different filters and concatenating them. Moreover, the authors used 1×1 convolutions for dimensionality reduction, avoid computational bottlenecks, and increase block non-linearity. Training the early stages of such a deep network is complicated; therefore, the authors also introduced two auxiliary classifiers to boost the training in early stages and benefit from this pseudo-multi-task learning. The network’s name is inspired by the Sci-Fi movie “Inception.”

Szegedy et al., Going Deeper with Convolutions CVPR 2015, [1]

Inception-v3 (CVPR 2016)

The architecture of Inception-v1 is enhanced in Inception-v2 that is quickly replaced with Inception-v3 that includes a battery of enhancements on optimizers and loss functions and includes batch normalization to the auxiliary networks. This network used factorization methods to avoid representation bottlenecks. In this sense, n×n convolutions are factorized to two 1xn and nx1 asymmetric convolutions, 5×5 convolutions are replaced by two 3×3 ones, and 7×7 convolution is replaced with a series of 3×3 ones.

Szegedy et al., Rethinking the Inception Architecture for Computer Vision, CVPR 2016, [1]

ResNet-50 (CVPR 2016)
Deeper CNNs suffer from a phenomenon called vanishing gradients, in which the training signal that is backpropagated from the last layer is getting weaker when propagated back to earlier layers. This causes the accuracy of the network to saturate by increasing the number of layers and decrease drastically after a certain point. To address this issue, skip connections have been borrowed from NLP. The authors also adopted batch normalization. Using these tricks, ResNet was capable of going as deep as 152 layers without sacrificing generalization.

He at el., Deep Residual Learning for Image Recognition, CVPR 2016, adapted [1]

Xception (CVPR 2017)
Replacing the Inception module with depthwise separable convolutions yields a version of CNN with slightly different characteristics. Inception modules capture cross-channel correlations with 1×1 convolutions, and spatial correlations within a channel are captured with regular 3×3 or 5×5 convolutions. Xception (eXtreme Inception) applies a 1×1 convolution to every channel and then a 3×3 convolution to its output.

Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, CVPR 2017, adapted [1]

Inception-v4 (AAAI 2017)
Inception-v4 improves Inception-v3 by changing the stem, tweaking the Inception-C module, adding more inception modules, using the same number of filters for each module (uniforming them), and extensive use of residual connections that accelerate training speed.

Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, AAAI 2016, adapted [1]

Inception-ResNets-v2 (AAAI 2017)
Inception-ResNet family (v1 and v2) converts Inception modules to residual Inception blocks, adds more Inception blocks to the network, and a new type of Inception block after the Stem module.

Szgedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, AAAI 2016, adapted [1]

ResNeXt-50
ResNeXt extends ResNet with 32 parallel towers within each module (borrowed from Inception architecture).

Xie et al., Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017, adapted [1]

DenseNet
To further exploit the effect of shortcut connections, DenseNet connects all layers directly with each other, such that the input of each layer consists of the feature maps of all earlier layers, and its output is passed to each subsequent layer. This growth is explicitly controlled by a hyperparameter, and the number of feature maps is reduced by 1×1 convolutions before 3×3 convolutions.

Huang et al., Densely connected convolutional networks, CVPR 2017

There are many variations of these popular backbones that improve one or more aspects of the backbones. For example, in dense residual networks, a dense residual block is proposed that enables the network to receive multi-level features from all preceding units through dense shortcuts.

Wang et al., Dense Residual Convolutional Neural Network based In-Loop Filter for HEVC, 2018

A remark on the model sizes:
Such backbones are mainly designed for object detection, where the detection accuracy (mainly on ILSVRC benchmark) and speed are the most important factors. The size of the model is another factor that is important for different applications as well as the computation speed.

Top-1 one-crop accuracy versus the number of operations required for a single forward pass from https://arxiv.org/pdf/1605.07678.pdf

Conclusion:

We have looked at the details of loads of different designs of neural networks to solve the challenges in segmentation tasks. Many datasets can already be handled with reasonable performance or IoU as we learned. We also see rapid progress for the early datasets and expect further improvements in the near future. Interesting to the autonomous vehicle industry is also the panoptic segmentation for which now a few datasets are available. In panoptic segmentation objects are not just detected in a pixel-wise fashion but also counted.

For map creation, segmentation algorithms can be used for a variety of tasks:

  • for the removal of dynamic objects,
  • for creating a semantic layer where signs and markers have to be detected,
  • for HD maps using lane detection, and
  • for change detection, e.g., when a construction site pops up you want to be able to detect each cone and place it on the map.

In a later posts, we will discuss 3D detection and more specific applications.

Notes:

[1] Special thanks to Raimi Karim for providing the crust of neural network visualizations.

This article was brought to you by yodayoda Inc., your expert in automotive and robot mapping systems.
If you want to join our virtual bar time on Wednesdays at 9 pm PST/PDT, please send an email to talk_at_yodayoda.co, and don’t forget to subscribe.

--

--