The SpaceNet Challenge Off-Nadir Buildings: Introducing the winners
SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet LLC member organizations.
The SpaceNet Challenge: Round 4, Off-Nadir Building Footprint Extraction hosted by TopCoder is complete! Congratulations to cannab, selim_sef, MaksimovKA, number13, and XD_XD for taking home the top 5 prizes. Check out this GitHub repository for their solution algorithm code. Their solutions represented a 1.5-fold improvement over our initial baseline model’s performance. This is the first post in a series where we’ll dig into the competitors’ solutions to find out how they applied machine learning to identify buildings in overhead imagery taken at look angles from 7 to 54 degrees off-nadir.
The Challenge
In Round 4 of the SpaceNet Challenge we asked competitors to identify building footprints from 27 different satellite collects taken during a single pass of a WorldView 2 satellite over Atlanta. The data comprised both South-facing and North-facing collects with look angle ranging from 7 degrees off-nadir to a whopping 54 degrees off-nadir. For more details about the challenge (or if you’re unfamiliar with that terminology), check out these earlier posts describing the competition:
Introducing the SpaceNet Off-Nadir Imagery Dataset
A Baseline Model for the SpaceNet Off-Nadir Building Detection Challenge
Challenges with the SpaceNet 4 off-nadir satellite imagery
We had 238 registrants and 21 competitors beating our baseline model in this extremely challenging competition.
Key takeaways
Before we get into the nitty-gritty, here are our main takeaways from running and scoring the challenge:
Building detection in “ideal” imagery is still far from perfect
In these perfectly cloudless, high-resolution Worldview 2 images over Atlanta, even when we restrict to the images looking nearly straight down (look angle of <25 degrees off-nadir, the “nadir” set), the best algorithm still missed about 21% of buildings.
Furthermore, about one out of every six buildings the winner predicted in the nadir set did not match up to a building on the ground (i.e. was a “false positive”).
There’s still a lot of room for improvement in automated analysis of overhead imagery!
It is very hard to identify buildings at very off-nadir look angles
Performance at look angle >40 degrees off-nadir was an average of 30% worse than <25 degrees off-nadir. The winners only positively identified about 50% of the buildings in these very off-nadir images:
…and about one in three of their predictions were false positives:
If you want to use algorithms to identify objects in very off-nadir imagery, be aware of these performance limitations.
Big ensembles of models help, but the juice may not always be worth the squeeze
The winning algorithm, an ensemble of 28 convolutional neural networks (CNNs) with a gradient boosting machine (GBM) filter to remove bad buildings yielded only a 5% improvement over the #5 algorithm, which comprised only 3 CNNs. This came at a substantial computing cost: the winning algorithm takes more than 10 times longer to identify buildings in new images than the 5th place competitor’s algorithm! With a difference of 6.5 vs. 68 seconds per square km of prediction, this may represent the difference between a deployable solution and an algorithm that is only academically relevant.
Let’s dive into the details!
Winners’ algorithm performance
A summary of the winners’ results alongside the baseline model are in the table below.
The results were remarkably close — cannab’s winning solution represents only a 5% improvement over XD_XD’s 5th place solution. Much of the competitors’ improvements over the baseline came in the off-nadir and very off-nadir image subsets, where the winning solutions represented 79% and 170% improvements over the baseline model. By contrast, the algorithm with the best performance in the nadir set only improved on the baseline by 32%:
So, how did the competitors achieve these improvements? This will be the topic of the remainder of this post and several deep dives to come.
A summary of the winners’ algorithms
Every top 10 competitor who submitted a solution used convolutional neural networks (CNNs) to identify building footprints. Each competitor produced a binary output image with each pixel labeled as 1 (contains a building) or 0 (does not contain a building), which they then converted to polygons outlining individual buildings. The intricacies of the competitors’ algorithms, summarized below, varied dramatically.
Let’s go through a couple of details of the competitors’ solutions:
Neural net architectures
The vast majority of competitors’ algorithms performed semantic segmentation using ensembles of U-Nets with a variety of state-of-the-art image classification encoders. Number13 used Mask-RCNN, a combined object detection/classification/segmentation algorithm.
Algorithm ensembles
Every competitor in the top 5 used an “ensemble” of CNNs, i.e., they trained multiple models and combined the results for their final output. Every top-5 competitor except for number13 trained every model on imagery from all of the collects combined, and then averaged the results from all of the models. By contrast, number13 trained a separate model to identify building footprints in imagery from each collect. These approaches seem to be similarly performant, though each have their downsides: large averaging ensembles take much longer for inference, whereas the one-model-per-collect approach is less likely to generalize well to new imagery.
Performance vs. prediction time: a trade-off
The results of this challenge highlight the cost-performance tradeoff associated with using very large ensembles of very deep CNNs: cannab’s winning code was carefully optimized to squeeze training and inference using as many CNNs as possible into the training and inference periods provided, an effort that was rewarded with the first place prize. However, this algorithm needs a lot of computing time to identify buildings in new imagery — 68 seconds per square kilometer. This means that the algorithm would need about 5–6 hours to identify all of the buildings in a single WorldView 2 collect. By contrast, the #5 solution provided by XD_XD could do the same in approximately 25–30 minutes with only a 5% drop in fidelity. An awareness of this balance is essential when considering model selection for a deployable product.
Competitor code
The competitors’ solutions can be found in this GitHub repository. Stay tuned for their own summaries of the code and instructions on how to download their trained models for use in inference.
There’s more coming soon!
We’ll be back shortly with another post that digs deeper into the advances these algorithms represent over winners of past SpaceNet Challenges, including input data selection, model training details (loss functions, training objectives, cross-validation), and post-processing strategies. Stay tuned, and follow us on twitter @CosmiQWorks and @NickWeir09 for more!
References:
- https://arxiv.org/abs/1505.04597
- https://arxiv.org/abs/1709.01507
- https://arxiv.org/abs/1707.01629
- https://arxiv.org/abs/1409.0575
- https://arxiv.org/abs/1512.03385
- https://arxiv.org/abs/1608.06993
- https://arxiv.org/abs/1602.07261
- https://arxiv.org/abs/1612.03144
- https://arxiv.org/abs/1703.06870
- https://arxiv.org/abs/1405.0312
- https://www.crowdai.org/challenges/mapping-challenge
- https://arxiv.org/abs/1409.1556
- https://arxiv.org/abs/1801.05746