A deep dive into the SpaceNet 4 winning algorithms
Augmentation approaches, loss functions, and learning objectives
Note: SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.
The SpaceNet Challenge Round 4: Off-Nadir Building Detection Challenge hosted by TopCoder completed recently, and we’ve had an opportunity to examine the competitors’ solutions. In this post we highlight a few key differentiators that improved segmentation in the winning algorithms. For a high-level overview of the competition see our earlier post, and for a summary of the solutions see this post.
Challenges with the SpaceNet Off-Nadir Dataset
Identifying buildings in the SpaceNet 4 Off-Nadir Dataset posed a few unusual challenges. For example, building density varies dramatically across different regions of Atlanta, the city covered in the dataset. Therefore algorithms had to accommodate both dense areas and sparse areas:
Building appearance was very different across the images collected at different angles: buildings appeared “normal” in the images taken directly overhead, while they were very distorted in the off-nadir collections. Furthermore, the apparent resolution dropped significantly as look angle increased:
Finally, some collections looked directly into shadowed areas of buildings, whereas others had very bright reflections from sunlit areas:
Most computer vision tasks do not face these challenges, and “standard” computer vision algorithms don’t necessarily address them well. For example, Our baseline model was a simple implementation of a popular deep learning model for computer vision, which did a very poor job of identifying buildings in very off-nadir imagery. The competitors’ solutions improved on this baseline by almost 300% in these very off-nadir images — how did they do it? After analyzing their solutions, we think there were three key details that helped: augmentation strategy, loss functions, and learning objectives. Let’s look at how each of those helped.
Several competitors discussed image augmentation in their solution descriptions. As in our baseline model, many competitors started out by flipping and rotating images. However, they found that these augmentations did not improve their models’ performance — in fact, it harmed their scores! After thinking more about off-nadir imagery, we think we know why. Placing building footprints correctly in off-nadir images requires not only identifying the buildings, but also accounting for distortion. In off-nadir looks, the roof of the buildings is displaced relative to their footprints on the ground:
To effectively place footprints in off-nadir images, algorithms needed to not only find the building but also to adjust for this displacement. Algorithms needed to “know” which direction the footprint was displaced — but this is hard to learn if images are rotated or flipped, changing the direction at random.
There’s an important corollary to the approach the competitors took: these algorithms are unlikely to generalize to new off-nadir looks with building roofs displaced in a different direction.
One major challenge for segmenting buildings in overhead imagery is their relative sparsity compared to other object types in natural photographs. The objects classified in the ImageNet dataset comprise a substantial fraction of the pixels in the image; by contrast, buildings make up fewer than 5% of the pixels in the SpaceNet Off-Nadir dataset. This poses a major challenge to segmentation algorithms because few loss functions can overcome the “all-zero valley”, as predicting that no pixels correspond to buildings yields 95% accuracy. To overcome this challenge competitors generally used a composite loss function comprising a binary cross-entropy loss variant alongside a loss function that specifically targets positive predictions: either Dice coefficient loss or Jaccard loss. The top two competitors trained their models with a composite of Dice and the relatively new Focal Loss, a binary cross-entropy variant that penalizes low-confidence predictions more strongly. These loss functions combined with the competitors’ advanced segmentation objective masks yielded high-fidelity building footprint extraction.
Neural networks for segmentation are generally trained to generate “pixel masks”, which are 0–1 probability density maps denoting the likelihood that each pixel corresponds to an object class (a semantic segmentation mask). However, our building detection challenges don’t stop there: we ask competitors to generate polygons labeling every building separately, making this an instance segmentation task. This is challenging because competitors must ensure that segmentation outputs for buildings don’t contact one another, or else the algorithm may fuse them into one instance prediction:
To aid their algorithms in learning building separation, many competitors added one to two additional channels to their neural net objectives:
- Building outline labels,
- Contact points between very closely juxtaposed buildings
These two additional channels are roughly equivalent to providing additional classes for the algorithm to learn.
Combining these three channels shows the layout of the objective that the competitors trained their algorithms to predict:
In post-processing, competitors subtracted the outline and contact regions and used watershed algorithms to separate very nearby buildings. Notably, this approach did not work for everyone: the 5th place competitor, XD_XD, indicated that labeling contact points did not improve his algorithm.
Conclusion and looking forward
Competitors used their loss functions, learning objectives, and augmentation strategy to address the unique challenges posed by the SpaceNet Off-Nadir Buildings Challenge task and data. They cut common image augmentations (rotation and flipping) from their pipeline so their algorithms could learn offset. They used loss functions optimized for a low foreground-to-background class ratio to ensure algorithms learned to find the relatively uncommon building pixels. Finally, they used advanced learning objectives to effectively separate buildings for this instance segmentation task.
Though this deep dive covered many details of how competitors identified buildings in off-nadir imagery, it doesn’t cover everything. In the next post, we will look at where algorithms performed well and where they failed: did different competitors’ models miss the same buildings, or did each model have unique failures? What would have happened if we had set different thresholds for building segmentation — for example, an IoU cutoff of 0.75 instead of 0.5? What types of objects yielded false positive predictions in the competitors’ solutions? For this and more, follow us at https://medium.com/the-downlinq and on Twitter @CosmiQWorks and @NickWeir09!