The good and the bad in the SpaceNet Off-Nadir Building Footprint Extraction Challenge
Parsing out the similarities, differences, and limitations for different geospatial image segmentation models
SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.
This is the final post in our series about the SpaceNet Challenge Round 4: Off-Nadir Building Detection Challenge. For an intro to the dataset, see this post. For a high-level overview of the winners’ algorithms, start here.
As regular readers will know, we have spent a lot of time digging into the prize-winning solutions to The SpaceNet Off-Nadir Building Footprint Extraction Challenge hosted by TopCoder. In this final post of our series about the challenge, I’ll explore the types of buildings that models identified well and geographic features that presented a challenge to the competitors. There were some striking similarities and differences between the predictions that winning algorithms generated. Analyzing the solutions yielded interesting insights for model deployability, as well as what to watch for when developing a geospatial deep learning product.
This post will answer a few questions:
- How accurately did the prize-winning algorithms label each building?
- How would different intersection-over-union thresholds influence scores?
- How did each algorithm perform across different look angles?
- Did different algorithms have different false positive or false negative rates?
- How similar were the predictions from the different algorithms?
- Did building size influence the likelihood that a building would be identified?
- How did trees blocking buildings influence building detection?
Read on for the answers to these questions!
How accurately did prize-winning algorithms label each building?
This challenge’s task is to identify building footprints. Competitors were asked to generate polygons that traced the boundary of each building. We used an intersection-over-union (IoU) threshold of 0.5 to classify predictions as “good enough” to be called correct:
The IoU threshold says a lot about what you value in an algorithm’s performance. A low IoU threshold means that you don’t care how much of a building is labeled correctly, only that some part of it is identified. This is closer to object detection methods, which only produce rectangular bounding boxes that mark where target objects exist. By contrast, a high IoU threshold demands that predictions closely trace the actual contours of a building to be deemed correct. For our competitions we use the threshold of 0.5 to strike a balance between these two extremes. It’s important to consider this threshold when evaluating computer vision algorithms for product deployment: how precisely must objects be labeled for the use case?
We explored how changing this IoU threshold to be larger or smaller than 0.5 would have affected algorithms’ scores and got some intriguing results. See the graph below for a schematic of what fraction of the buildings each algorithm identifies correctly as we change this threshold:
The graph above shows how important this threshold is: had the threshold been set lower, MaksimovKA (blue) and XD_XD (orange) would probably have nearly matched the winner, cannab (green), in the leaderboard! In the inset you can see that increasing the threshold from 0.2 to 0.5 substantially reduces their recall relative to the other competitors.
Some competitors identified many more buildings than others, but with such low IoU that the identifications were not classified as positive IDs. For example, see the below graphs comparing XD_XD’s predictions and selim_sef’s predictions:
XD_XD’s algorithm correctly identified part of a lot of buildings in the very off-nadir subset — about 70% of the actual buildings in the imagery. By contrast, selim_sef’s algorithm only found a part of about 57% of the buildings. However, almost 30% of XD_XD’s very off-nadir building predictions were not good enough to satisfy the IoU threshold of 0.5, whereas that was only true for 9% of selim_sef’s very off-nadir predictions. As a result, the two had nearly identical recall scores in the very off-nadir imagery.
How would changing the IoU threshold have influenced each competitor’s recall scores? Let’s check three possible thresholds: 0.25, 0.5, and 0.75:
As you can see, each competitor’s correct identification rate (purple) drops as the threshold is increased; however, some drop more than others based on how well each prediction matched its respective building. The orange portion of the bars represent buildings that were identified, but “not well enough” — with an IoU score greater than zero but less than the threshold. The red bars represent buildings that were missed completely — an IoU of 0 — a population which, of course, is unaffected by IoU threshold.
Performance by look angle
Next, let’s examine how each competitor’s algorithm performed at every different look angle. We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:
Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their [tight packing at the top of the leaderboard]. Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees). Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees. cannab seems to have won on his algorithm’s performance on very off-nadir imagery!
An interesting takeaway from the bottom two graphs is that competitors had a bigger separation in their precision than in their recall, meaning that there was more variation in false positive rates than false negative rates. With the exception of a substantial drop-off by selim_sef’s algorithm in the most off-nadir imagery, the five competitors’ recall scores were almost identical throughout the range of look angles. By contrast, selim_sef had markedly better precision in the very off-nadir images than any other competitor, though cannab also clearly beat the other three prize-winners in this metric. cannab and selim_sef were the only two competitors to use gradient boosting machines to filter false positives out of their predictions, which likely gave them the upper hand in precision.
One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges. The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings:
This pattern was even stronger in our baseline model. Look angle isn’t all that matters — look direction is also important!
Seeing how similar these patterns are, we next asked how similar the competitors’ predictions are. Did they identify the same buildings and make the same mistakes, or did different algorithms have different success/failure patterns?
Similarity between winning algorithms
We examined each building in the imagery and asked how many competitors successfully identified it. The results were striking:
Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins! This means that the algorithms only differed in their ability to identify about 20% of the buildings. The algorithms differed more in the very off-nadir range, but still only 30% of buildings were found by one or more of the competitors that were not found by all of them. Given the substantial difference in computing time needed to train and generate predictions from the different algorithms, we found this notable.
Another way to explore how similar the algorithms are to one another is to measure the Jaccard similarity between their predictions. For each pair of algorithms, we counted how many buildings both identified, then divided that by the set of buildings that at least one found (that is, the IoU of their prediction sets):
Note the scale bar on the right — the lowest Jaccard similarity between any prediction set was 0.7, and that only occurred between number13 and XD_XD in the very off-nadir angle subset. No two algorithms generated predictions that were less than 80% similar by this metric when considering the entire dataset.
False positives: where did algorithms fail?
As the algorithms’ correct predictions were very similar, we next asked if the same was true of their false positive predictions — places where algorithms incorrectly predicted buildings in the images. We split the false positives into two sets: all predictions that did not satisfy the IoU threshold of 0.5, and the subset that did not overlap with an actual building at all (an IoU of zero). We then went through every competitor’s false positive predictions, counting how many of them overlapped between pairs of competitors. We used the same Jaccard metric to quantify their similarity:
Though there was more variability in false positives than in correct predictions, we still found these results intriguing: the five different algorithms, comprising different neural net architectures, loss functions, augmentation strategies, and inputs, often generated very similar incorrect predictions. For example, cannab and selim_sef’s IoU = 0 false positives (those that did not overlap with an actual building at all) overlapped with one another over 95% of the time! After checking to ensure that these false positives did not correspond to actual buildings that were missed by manual labelers (a few were, but under 10% of total), we found some interesting examples to show here:
Now that we’ve explored dataset-wide statistics, let’s drill down to what made some buildings easier or harder to find. For this portion, we’ll focus on cannab’s winning algorithm.
Performance vs. building size
The size of building footprints in this dataset varied dramatically. We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range? The graph below answers that question.
Even the best algorithm performed relatively poorly on small buildings. cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir. This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset. It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.
For the first time in the history of the SpaceNet Dataset, we included labels identifying buildings occluded by trees. This allows us to explore building detection in the densely treed suburbs surrounding Atlanta (see the image at the beginning of this post). So, how well did the best algorithm do at identifying buildings partially blocked from the satellite by overhanging trees?
cannab’s algorithm only showed a small drop in performance for occluded buildings. This is encouraging: it indicates that algorithms can learn to work around occlusions to find unusual-shaped subgroups of a class, still classifying their footprints correctly.
The top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions. Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks. Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance. Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!
This brings us to an end of the SpaceNet Challenge Round 4: Off-Nadir Building Detection. Thank you for reading and we hope you learned as much as we did. Follow The DownlinQ on Medium and the authors on Twitter @CosmiQWorks and @NickWeir09 for updates on the next SpaceNet Challenge coming soon!