A deep dive into the SpaceNet 4 winning algorithms

Augmentation approaches, loss functions, and learning objectives

Note: SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.

The SpaceNet Challenge Round 4: Off-Nadir Building Detection Challenge hosted by TopCoder completed recently, and we’ve had an opportunity to examine the competitors’ solutions. In this post we highlight a few key differentiators that improved segmentation in the winning algorithms. For a high-level overview of the competition see our earlier post, and for a summary of the solutions see this post.

Challenges with the SpaceNet Off-Nadir Dataset

Identifying buildings in the SpaceNet 4 Off-Nadir Dataset posed a few unusual challenges. For example, building density varies dramatically across different regions of Atlanta, the city covered in the dataset. Therefore algorithms had to accommodate both dense areas and sparse areas:

Building density varies dramatically across Atlanta, and competitors needed to develop algorithms that could accurately segment buildings in both low-density (left) and high-density (right) areas. And these are just the suburban/rural areas — the urban center is very different! Imagery courtesy of DigitalGlobe.

Building appearance was very different across the images collected at different angles: buildings appeared “normal” in the images taken directly overhead, while they were very distorted in the off-nadir collections. Furthermore, the apparent resolution dropped significantly as look angle increased:

The city appears very different in nadir (left) vs. very off-nadir (right) collections, both in distortion and resolution terms. For more on look angle, see our previous post. Imagery courtesy of DigitalGlobe.

Finally, some collections looked directly into shadowed areas of buildings, whereas others had very bright reflections from sunlit areas:

The same suburban area imaged at nearly identical look angles (30 degrees off-nadir, left, vs. 32 degrees off-nadir, right), but from different directions: the left image was taken from the South side of the city, whereas the right image was taken from the North. Shadows make it harder to see buildings in the right image. Imagery courtesy of DigitalGlobe.

Most computer vision tasks do not face these challenges, and “standard” computer vision algorithms don’t necessarily address them well. For example, Our baseline model was a simple implementation of a popular deep learning model for computer vision, which did a very poor job of identifying buildings in very off-nadir imagery. The competitors’ solutions improved on this baseline by almost 300% in these very off-nadir images — how did they do it? After analyzing their solutions, we think there were three key details that helped: augmentation strategy, loss functions, and learning objectives. Let’s look at how each of those helped.

Augmentation strategy

Several competitors discussed image augmentation in their solution descriptions. As in our baseline model, many competitors started out by flipping and rotating images. However, they found that these augmentations did not improve their models’ performance — in fact, it harmed their scores! After thinking more about off-nadir imagery, we think we know why. Placing building footprints correctly in off-nadir images requires not only identifying the buildings, but also accounting for distortion. In off-nadir looks, the roof of the buildings is displaced relative to their footprints on the ground:

Buildings are perfectly outlined in the nadir looks, but the footprint is offset in distorted off-nadir looks. Competitors’ solutions needed to account for this. Imagery courtesy of DigitalGlobe.

To effectively place footprints in off-nadir images, algorithms needed to not only find the building but also to adjust for this displacement. Algorithms needed to “know” which direction the footprint was displaced — but this is hard to learn if images are rotated or flipped, changing the direction at random.

There’s an important corollary to the approach the competitors took: these algorithms are unlikely to generalize to new off-nadir looks with building roofs displaced in a different direction.

Loss functions

One major challenge for segmenting buildings in overhead imagery is their relative sparsity compared to other object types in natural photographs. The objects classified in the ImageNet dataset comprise a substantial fraction of the pixels in the image; by contrast, buildings make up fewer than 5% of the pixels in the SpaceNet Off-Nadir dataset. This poses a major challenge to segmentation algorithms because few loss functions can overcome the “all-zero valley”, as predicting that no pixels correspond to buildings yields 95% accuracy. To overcome this challenge competitors generally used a composite loss function comprising a binary cross-entropy loss variant alongside a loss function that specifically targets positive predictions: either Dice coefficient loss or Jaccard loss. The top two competitors trained their models with a composite of Dice and the relatively new Focal Loss, a binary cross-entropy variant that penalizes low-confidence predictions more strongly. These loss functions combined with the competitors’ advanced segmentation objective masks yielded high-fidelity building footprint extraction.

Objective Masks

Neural networks for segmentation are generally trained to generate “pixel masks”, which are 0–1 probability density maps denoting the likelihood that each pixel corresponds to an object class (a semantic segmentation mask). However, our building detection challenges don’t stop there: we ask competitors to generate polygons labeling every building separately, making this an instance segmentation task. This is challenging because competitors must ensure that segmentation outputs for buildings don’t contact one another, or else the algorithm may fuse them into one instance prediction:

Separating nearby items of the same object is essential for instance segmentation. From top to bottom, an image of three buildings is passed through the instance segmentation process. First, a deep learning algorithm creates semantic segmentation masks that label pixels it believes correspond to buildings — on the left, it does so imperfectly, labeling non-building regions between adjacent buildings (red arrows). Next, contiguous mask regions are labeled as individual instances. Because the entire mask is interconnected in the left case, it gets labeled as a single large building; by contrast, the three buildings were separated in the right mask and therefore are labeled as three separate buildings. Our scoring metric would have marked the left case as a failure for all three buildings and the right case as correct for all three, with the only difference being the small arrowed mask area. Imagery courtesy of DigitalGlobe.

To aid their algorithms in learning building separation, many competitors added one to two additional channels to their neural net objectives:

  1. Building outline labels,
  2. Contact points between very closely juxtaposed buildings

These two additional channels are roughly equivalent to providing additional classes for the algorithm to learn.

Combining these three channels shows the layout of the objective that the competitors trained their algorithms to predict:

A sample segmentation mask from the competition winner’s solution description. Each color corresponds to a different objective mask channel. Blue marks building footprints; pink outlines the buildings; green denotes points where buildings are closely juxtaposed.

In post-processing, competitors subtracted the outline and contact regions and used watershed algorithms to separate very nearby buildings. Notably, this approach did not work for everyone: the 5th place competitor, XD_XD, indicated that labeling contact points did not improve his algorithm.

Conclusion and looking forward

Competitors used their loss functions, learning objectives, and augmentation strategy to address the unique challenges posed by the SpaceNet Off-Nadir Buildings Challenge task and data. They cut common image augmentations (rotation and flipping) from their pipeline so their algorithms could learn offset. They used loss functions optimized for a low foreground-to-background class ratio to ensure algorithms learned to find the relatively uncommon building pixels. Finally, they used advanced learning objectives to effectively separate buildings for this instance segmentation task.

Though this deep dive covered many details of how competitors identified buildings in off-nadir imagery, it doesn’t cover everything. In the next post, we will look at where algorithms performed well and where they failed: did different competitors’ models miss the same buildings, or did each model have unique failures? What would have happened if we had set different thresholds for building segmentation — for example, an IoU cutoff of 0.75 instead of 0.5? What types of objects yielded false positive predictions in the competitors’ solutions? For this and more, follow us at https://medium.com/the-downlinq and on Twitter @CosmiQWorks and @NickWeir09!