Object Detection on SpaceNet

This post presents a walk through of an object detection process applied to SpaceNet imagery. The strategy employed is far from optimal but rather is meant to illustrate a complete workflow based on a previous post. Sample Python code is included to demonstrate basic GIS functionality within Python and neural network design in TensorFlow.

Similar to many other computer vision problems, we desire to find objects (in particular, building footprints) in images (satellite imagery from SpaceNet). There are several properties of this challenge that distinguish it from other computer vision problems:

  • The classifier uses 11 bands within the satellite imagery.
  • We pre-process the labeled imagery using a distance transform to create a richer labeled data set.
  • We train a weighted fully convolutional neural network to label the output; the weights give feedback on the importance of bands and layers.
  • We post-process the output to generate non-rectangular region proposals and output the proposed regions into geoJSON files.
  • We implement the metric derived from Intersection-over-Union to compute an F1-score for the machine learning algorithm.


The code snippets in this blog are shared under the following license:

License for Gist


More information on SpaceNet is available here. For each sub-region, there are two images (GeoTIFFs) and one label (geoJSON):

  • The 3-band raster image, at roughly 0.5 m ground sampling distance, contains Red, Green, and Blue color channels with 8-bit values.
  • The 8-band raster image, at roughly 2 m ground sampling distance, contains both visible spectrum channels and near infrared channels with 16-bit values. The three-band image is derived from a panchromatic image and a subset of the three channels — the details of the derivation are available from the WorldView 2 specifications.
  • The label contains a list of polygons contained within the sub-region stored in a geoJSON format. Each polygon (or multi-polygon for non-simply connected regions) represents a building footprint.


Since geoJSONs can be more descriptive than just bounding boxes, we employ a strategy that is more flexible with boundary descriptions than a rectangular box, see [Yuan2016]. This strategy transforms a label of building footprints into a raster image where each pixel has the value of the distance that pixel is to the boundary of a building — exterior pixels have a negative distance, interior pixels have a positive distance, the boundary of the footprint has a distance of zero. Provided the building footprints do not overlap, this transform is invertible. The following unoptimized Python code performs the relabeling (for an implementation that uses the more efficient GDAL library see SpaceNetChallenge):

QGIS is an open-source tool for managing and editing GeoTIFFs and geoJSON files. In Image 1, we present a 3-band image from the SpaceNet collection overlaid with the geoJSON for building footprints as red polygons.

Image 1: geoJSON Labels for buildings overlayed on 3-band GeoTIFF. The image was created using QGIS.
Image 2: Distance to the Boundary (Saturated). The image was created in Python using matplotlib. Since we are interested in regions near the boundary, we saturate the interior pixels to Red and exterior pixels to Blue — the boundary is yellow.

Applying the distance transform to the geoJSON features, we obtain a continuously valued label. The transform is useful in separating buildings that are close together — this separation improves object detection based on Intersection-over-Union described below.

Architecting the Classifier.

We build a fully convolutional neural network (FCNN), that we call CosmiQNet, to classify a pair of input images, one 3-band and one 8-band. The output of the classifier is a distance transform.

Figure 1: Blackbox Fully Convolution Neural Network: CosmiQNet. The inputs are at two resolutions and the output distance transform matches the lower of the input resolutions. The resolution of the 8-band GeoTIFF is roughly one quarter (in each dimension) the resolution of the the 3-band GeoTIFF; the difference in resolution is depicted by the scale of the GeoTIFFs.

We include bypass parameter weights in each layer of CosmiQNet to facilitate training and to provide feedback on the necessary required depth. This architecture was developed for another application but seems to perform reasonably in this application. If you want to hear more about these FCNNs, I plan to present at GTC DC 2016 for a different application.

The philosophy of CosmiQNet is that each additional layer refines the output of the network. The refinement is integrated into the training algorithms: where each additional layer is locally optimized, then all the previously layers are optimized together with the new layer. The network is extended by optimizing a convex combination of the previous layer and the current layer, producing a weight for the new layer that measures its contribution to the final output. As the network grows (training deeper layers), the weights decrease — each deeper layer is a slight perturbation of the identity map.

Figure 2: CosmiQNet v1.

Below is a code snippet written in Python using TensorFlow to create the FCNN we call CosmiQNet. The code includes the architecture, the cost function, and the optimizers used in training.

Code Gist 1: CosmiQNet v1 in Python using TensorFlow.

This architecture loses very little information during each layer. However, repeated perturbations of the distance transform increase the dependence on the first layer of CosmiQNet to integrate all 11 imagery bands. Alternatively, we can feed the original imagery into each additional layer along with the output of the previous layer, which we call CosmiQNet v2. For these architectures, the deep layers empirically (and perhaps theoretically, as well) converge to be perturbations of the identity map. CosmiQNet v2 performs slightly better than the original CosmiQNet at the cost of additional training time and increasing the parameter space roughly 12-fold.

Code Gist 2: CosmiQNet v2 in Python using TensorFlow
Figure 3: CosmiQNet v2

The cost function is the square of the Euclidean distance between the output and the distance transform derived from the label. Since we care mostly about the boundary, there are likely better cost functions that focus more on the boundary.

Testing the Classifier.

CosmiQNet is constructed to be trained a layer at a time. Each successive layer refines the previous layer. To train a layer, first the parameters specific to that layer are optimized, and second all of the previous parameters are optimized as well.

Code Gist 4: Python code for training of CosmiQNet using TensorFlow
Figure 5: Layer Weights for CosmiQNet v1. The weights show the contribution of each successive layer of CosmiQNet v1.


The output of the CosmiQNet is a heatmap of predicted distances to the boundary of buildings. The post-processing converts this heatmap into a geoJSON of proposed building footprints. We employ a greedy algorithm for cluster growing (Python code):

  1. Define B to be the set of points in the heatmap that are predicted to be inside the building (positive distance from the boundary). Let b1 be the point in the heatmap with the largest value.
  2. Define the cluster, c1, as the largest connected component of B containing b1 such that there is a path of decreasing values between b1 and every point in c1. The cluster can be computationally identified with a nearest neighbor search.
  3. Remove c1 from B and repeat to find b2 and c2.

The greedy algorithm for clustering can certainly be improved. Building footprints of long, narrow buildings or non-convex buildings create erroneous output from the greedy algorithm.

We then convert the array of clusters into a geoJSON using Python GDAL commands.

Code Gist 4: Python Code using GDAL to create geoJSON.

In the creation of the vector layer, GDAL produces some extraneous polygons. These polygons can be removed with GDAL commands that select features — the feature selection can be used for further post-processing.

Code Gist 5: Python Code using GDAL to select features from geoJSON

Intersection over Union

Finally, we want to compare the Jaccard Index to the labeled data to assess the classifier. Recall that the Jacquard Index (Intersection over Union, IoU) between two sets, A and B, is defined as:

The scoring algorithm that we implement is based on Algorithm 2 from the ILSVRC documentation. To compute the metric based on the Jaccard Index, we compute true positives, false positives, and false negatives based on the IoU between a proposal and a label being above a threshold of 0.5. Code to implement the evaluation of the metric is available of the SpaceNetChallenge github repository.


Image 3: Example image from SpaceNet

Image 3 is an example image from SpaceNet that has not been included in the training imagery for CosmiQNet. Image 4 show the same SpaceNet image with the labeled building footprints. These images demonstrate that real-world labeling is imperfect: footprints are slightly mis-aligned or even missing. Trees and shadows partially obstruct several footprints. These complexities increase the difficulty of hand labeling and necessitate robust machine learning algorithms.

Image 4: SpaceNet Image with Ground Truth Overlays (blue).
Image 5: Heat map example of ground truth distance transform. Red is interior, yellow is the boundary, and blue is exterior.

For the distance transform to be a good predictor of buildings, we must assume that the footprints of buildings do not overlap. The slightest overlap essentially merges the two distinct buildings into one. Also, the strategy assumes simple connectivity of buildings (no holes) — this may produce some errors for non-simply connected buildings.

Image 6: Output of CosmiqNet v1 to predict distance transform. Red is interior, yellow is the boundary, and blue is exterior.

Since the output layer of CosmiQNet is a distance transform, training the network optimizes the Euclidean distance to the derived distance transform from the label. This cost function may not be optimal for finding footprints. Further analysis of the distance transforms provides a measure of how much noise there is in the output layer — as the network trains, the measure of noise in the output layer provides valuable feedback without having to perform more expensive post-processing.

Image 7: Output of clustering algorithm applied to CosmiQNet v1. Different colors correspond to the prediction of different building footprints.

The Greedy Algorithm for growing clusters from the distance transform is dependent on a threshold; we have selected a 0.5 exterior/interior threshold. Also, it is possible to dilate the clusters to grow them as well. Since the scale is known and some assumptions can be made about the size of the building footprints, one could choose these parameters even on a per cluster basis. We did not investigate the optimization of these parameters at this time.

Image 8: CosmiQNet-v1 Proposals above IoU Threshold of 0.5 (green)

Finally, we have some IoUs. The green polygons are the true positives of the algorithm with the IoU label on the polygon.

Image 9: CosmiQNet-v1 Proposals above IoU Threshold 0.25 (yellow). The true positives with respect to the 0.5 threshold are also plotted in green.

We also show an overlay with near-misses: proposed building footprints with IoU below the threshold that could become true positives with refinement of post-processing and better training of CosmiQNet.

Image 10: CosmiQNet-v1 Proposal with Labeled IoU. The red proposal in the Northeast region that is being counted as a false positive is a result of a ground truth labeling mistake. Yellow regions are false positives that are close to the threshold and green regions are true positives.

All proposed building footprints are shown with their respective IoUs. Even the significant misses have relevant information that could be useful for secondary testing or counting. The greedy clustering algorithm does not factor into account the rectangular (for the most part) shape of the labeled building footprints. Post-processing that includes this ansatz could improve the accuracy of the algorithm.


CosmiQNet v2 predicts the distance transform better than CosmiQNet v1, but does not significantly differ in performance with respect to the IoU metric at a threshold of 0.5. There are several parameters to be optimized and post-processing that can be done to improve the performance of CosmiQNet.

A bounding box is the smallest vertically aligned rectangular region that contains the building footprint. In image 11, the bounding boxes are shown in yellow and green for the building footprint in blue. The area of the building footprint is always less than the area of the bounding box — in some cases the difference in areas is large enough to have the IoU between the bounding box and the ground truth footprint to fall below the 0.5 threshold. The F1-score for using bounding boxes as proposals is 0.729.

Image 11: Example image with ground truth derived bounding boxes. Blue polygons are ground truth building footprints. Green rectangles are bounding boxes derived from ground truth footprints and have an IoU with the corresponding footprint greater than the 0.5 threshold (true positives). Yellow rectangles are bounding boxes derived from the ground truth footprints but have an IoU less than 0.5 (false positives).

Another measure of comparison is the application of the greedy clustering algorithm to the ground truth distance transform. The F1-score for this procedure is 0.54, indicating that improving the post-processing should have a large impact on the accuracy of the classifier. The F1-scores for CosmiQNet version 1 and version 2 are 0.19 and 0.20 respectively. A future blog post will discuss the details of the metric and methods to improve performance of the classifiers.

The F1-scores above may seem low compaerd to the touted accuracies of the top performers from the ImageNet Large Scale Visual Recognition Competition. But it is important to keep in mind the following:

  • The algorithms from the ILSVRC have been optimized for the ImageNet data sets.
  • ImageNet uses bounding boxes for object detection as opposed to the polygonal building footprint labels in SpaceNet.
  • Some SpaceNet images have a large number of tightly packed buildings; in these cases, it is easier to label a region as being in a building than it is to separate the buildings from each other.