Object Segmentation on SpaceNet via Multi-task Network Cascades (MNC)

Published in

The DownLinQ

7 min readApr 18, 2017

This blog post shares the results of applying the Multi-task Network Cascades (MNC) object segmentation algorithm (available here) to the SpaceNet challenge (available here).

Please note that the algorithm trained here is for research purposes only, and is not actively competing in the current SpaceNet competition.

The SpaceNet Competition

The second SpaceNet competition asks its participants to submit an algorithm that inputs satellite images (of Las Vegas, Shanghai, Khartoum, and Paris) and outputs polygons of building footprints.

An algorithm’s proposed building footprint is considered a true positive if its IOU score with a ground truth building footprint is at least 0.5, and if no other proposal has a higher IoU with the ground truth building. For more precise information on how the SpaceNet competition is scored, see here.

A SpaceNet satellite image over Paris together with its labels visualized

Motivation

The MNC algorithm is a state-of-the-art object segmentation algorithm. MNC won 1st place in the MS COCO 2015 image segmentation competition, garnering wide popularity.

Segmentation means that in addition to detecting the location of objects in an image, the algorithm further classifies which pixels belong to an object within an image. Two examples of segmentations produced by MNC are provided in the image below.

Furthermore, at least two submissions to the first SpaceNet competition used MNC, motivating further inquiry into the application of this algorithm to SpaceNet.

A Technical Overview of MNC

In this section, we include a high-level, technical overview of MNC for the interested reader. None of the ideas in this section are my own, and further details can be found in the original MNC paper.

Shared Convolutional Features Maps

The first step of MNC is to run the original image through a convolutional neural network (CNN). The outputs of this CNN are referred to as shared convolutional features maps, and will be the inputs into all later steps of the MNC algorithm.

Aside: The word maps in convolutional features maps means we are explicitly encoding the location of the feature as well as the feature itself.

A previous generation object detection algorithm (R-CNN) generates region proposals on the original image, and subsequently runs each region proposal through a CNN. MNC (building on Fast R-CNN) takes the reverse approach: the whole original image is first run through a CNN, and region proposals are generated using the output features of this CNN. Thus, the convolutional features need only be computed once, which saves time.

Multi-Task Learning in MNC

Instance-aware semantic segmentation can generally be decomposed into three different but related subtasks: finding bounding boxes of objects, producing pixel-level masks within bounding boxes, and categorizing the masked object. One novel insight of MNC is a twist on multi-task learning that allows for a later stage to depend on the output of an earlier stage.

Below is an image of 5-stage learning for MNC.

Image taken from the original MNC paper.

The inputs to stage 1 are the shared convolutional features, the outputs of a CNN whose input is the original image as discussed above. The outputs of stage 1 are proposed bounding boxes of image objects (B). Producing bounding boxes from convolutional features is via a Region Proposal Network (RPN).

The bounding boxes (B) produced by stage 1 together with the shared convolution features are the inputs into stage 2 which produce pixel-level masks (M).

The pixel-level masks (M) together with the share convolutional features are the inputs into stage 3 which produce the categories of masked objects as well as updated bounding boxes (B’), from a box regression layer.

The new bounding boxes (B’) together with the shared convolutional features are used to produce new masks (M’) and then finally new category scores (C’).

MNC Results on SpaceNet

We trained and evaluated the MNC algorithm separately on each of the four SpaceNet cities (Las Vegas, Paris, Shanghai, and Khartoum) obtaining an average F1 score of 0.57. Our average F1 score should be compared with 0.60, the average F1 score of the current frontrunner of the second SpaceNet competition.

Our performance score is visualized below in more detail.

A Few Example Images of MNC on SpaceNet

We now include several images of MNC applied to test satellite images over Las Vegas. The color white represents a true positive for an IOU threshold of 0.5, the color yellow a false positive, and the color blue a false negative. The numbers printed inside the proposed polygons are the IOU scores for the proposal footprint.

These images are produced using Topcoder’s visualizer for the SpaceNet challenge linked here.

Warning: These are example test images of Las Vegas, our highest performing city. Satellite images over other cities look less promising, and there is clearly much work ahead for those of us working with messy satellite data.

Limitations of MNC on SpaceNet

MNC struggles with satellite images containing:

Small buildings. Indeed, many false negatives are due to small buildings significantly decreasing the recall score of MNC.
L-shaped or concave buildings. This trained version of MNC seems to predict very smooth, convex building footprints.
Buildings with dark rooftop colors that can sometimes be confused with the surrounding area.

MNC struggles with L-shaped buildings on SpaceNet

Training MNC on SpaceNet

To train MNC on SpaceNet data, one must first convert the SpaceNet tif images to jpg images, and convert the SpaceNet geoJson labels to Pascal VOC SBD labels. This step is discussed in a previous blog post, linked here. The code for converting SpaceNet labels to Pascal VOC SBD labels is slightly different for the second SpaceNet competition, and is linked here.

After this data conversion, one can train the 5-stage MNC model by following the instructions in MNC’s README.md. A docker container suitable for training MNC can be found here.

We train MNC on SpaceNet data for 25,000 iterations with stochastic gradient descent and a base learning rate of 0.0001. Note that the default learning rate for MNC is 0.01. It is possible that a better solution may arise from further adjusting the learning rate and the number of training iterations.

Testing MNC on SpaceNet

We wish to run our test SpaceNet images through the trained MNC model to generate building footprints and evaluate our algorithm. See the above section “A Few Example Images of MNC on SpaceNet” for pictures of this output.

First we define the network using the trained model discussed above.

We run each image in the SpaceNet test directory through the net defined in the gist above. The output of this is 600 proposed bounding boxes, proposed masks, and proposed category scores.

MNC further processes this list of 600 boxes, masks, and category scores using non-maximal suppression (NMS) to remove redundant object detections. That is, a set of bounding boxes with high overlap are most likely detecting the same ground truth object. NMS picks one bounding box to represent this set.

We are left with a new, shorter list of bounding boxes and masks, which we take to be the output of the trained MNC algorithm.

Post-Processing SpaceNet Test Data

Finally, one can convert this new, shorter list of bounding boxes and masks to geoJson format using gdal.Polygonize.

One then converts the entire directory of geoJson proposals into a CSV file using the SpaceNet utility createCSVFromGEOJSON.py.

Our proposals are now in the correct format to submit to the SpaceNet competition!

We include a github repository for the test-time inference and post-processing here.

Going Further

We include several ideas for further research with MNC on SpaceNet:

MNC generates region proposals at three different scales and three different aspect rations. One could add a fourth, smaller scale to possibly aid in detecting small objects, currently false negatives of MNC.
One could increase the number of regional proposals before applying non-maximal suppression to possibly increase the recall score for MNC.
One could vary the total number of training iterations and the base learning rate to optimize both of these parameters. This is computationally very expensive.
Include 8-band satellite imagery into the training of MNC. A priori, 8-band imagery has more information than standard RGB imagery. Hence, 8-band imagery has potentially more information to exploit during machine learning.

Summary

In this blogpost we outlined the pre-processing, training, inference, post-processing, and the results of MNC applied to the second SpaceNet competition. Our results appear to be competitive with current SpaceNet competition submissions.

Acknowledgements

We thank Adam Van Etten, Patrick Hagerty, and David Lindenbaum for many helpful conversations regarding SpaceNet and object detection.