The good and the bad in the SpaceNet Off-Nadir Building Footprint Extraction Challenge

Parsing out the similarities, differences, and limitations for different geospatial image segmentation models

SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.

An image from the SpaceNet Atlanta dataset with building footprints colored by the number of competitors that successfully identified them, from none (red) to some (yellow/green) to all (blue). Imagery courtesy of DigitalGlobe.

This is the final post in our series about the SpaceNet Challenge Round 4: Off-Nadir Building Detection Challenge. For an intro to the dataset, see this post. For a high-level overview of the winners’ algorithms, start here.

As regular readers will know, we have spent a lot of time digging into the prize-winning solutions to The SpaceNet Off-Nadir Building Footprint Extraction Challenge hosted by TopCoder. In this final post of our series about the challenge, I’ll explore the types of buildings that models identified well and geographic features that presented a challenge to the competitors. There were some striking similarities and differences between the predictions that winning algorithms generated. Analyzing the solutions yielded interesting insights for model deployability, as well as what to watch for when developing a geospatial deep learning product.

Topics

This post will answer a few questions:

  1. How accurately did the prize-winning algorithms label each building?
  2. How would different intersection-over-union thresholds influence scores?
  3. How did each algorithm perform across different look angles?
  4. Did different algorithms have different false positive or false negative rates?
  5. How similar were the predictions from the different algorithms?
  6. Did building size influence the likelihood that a building would be identified?
  7. How did trees blocking buildings influence building detection?

Read on for the answers to these questions!

How accurately did prize-winning algorithms label each building?

This challenge’s task is to identify building footprints. Competitors were asked to generate polygons that traced the boundary of each building. We used an intersection-over-union (IoU) threshold of 0.5 to classify predictions as “good enough” to be called correct:

A schematic representation of the Intersection over Union metric. The overlapping area between the manually labeled ground truth (blue) and the predicted building (red) is divided by the combined area covered by both together. For more on the metric, see this explainer post. Imagery courtesy of DigitalGlobe.

The IoU threshold says a lot about what you value in an algorithm’s performance. A low IoU threshold means that you don’t care how much of a building is labeled correctly, only that some part of it is identified. This is closer to object detection methods, which only produce rectangular bounding boxes that mark where target objects exist. By contrast, a high IoU threshold demands that predictions closely trace the actual contours of a building to be deemed correct. For our competitions we use the threshold of 0.5 to strike a balance between these two extremes. It’s important to consider this threshold when evaluating computer vision algorithms for product deployment: how precisely must objects be labeled for the use case?

We explored how changing this IoU threshold to be larger or smaller than 0.5 would have affected algorithms’ scores and got some intriguing results. See the graph below for a schematic of what fraction of the buildings each algorithm identifies correctly as we change this threshold:

Recall, or the fraction of actual buildings identified by algorithms, depends on the IoU threshold. Some algorithms identified part of many buildings, but not enough to be counted as a successful identification at our threshold of 0.5. The inset shows the range of IoU thresholds where XD_XD (orange)’s algorithm went from being one of the best in the top five to one of the worst out of the prize-winners.

The graph above shows how important this threshold is: had the threshold been set lower, MaksimovKA (blue) and XD_XD (orange) would probably have nearly matched the winner, cannab (green), in the leaderboard! In the inset you can see that increasing the threshold from 0.2 to 0.5 substantially reduces their recall relative to the other competitors.

Some competitors identified many more buildings than others, but with such low IoU that the identifications were not classified as positive IDs. For example, see the below graphs comparing XD_XD’s predictions and selim_sef’s predictions:

Building footprint recall is plotted against the IoU threshold for selim_sef and XD_XD, stratified by look angle: nadir (≤25 degrees, blue), off-nadir (26–40 degrees, orange), and very off-nadir (>40 degrees, green). selim_sef achieves relatively similar recall values at IoU thresholds ≤0.5, whereas XD_XD’s performance drops precipitously in this range. At an IoU threshold of 0.5, their recall is more or less identical.

XD_XD’s algorithm correctly identified part of a lot of buildings in the very off-nadir subset — about 70% of the actual buildings in the imagery. By contrast, selim_sef’s algorithm only found a part of about 57% of the buildings. However, almost 30% of XD_XD’s very off-nadir building predictions were not good enough to satisfy the IoU threshold of 0.5, whereas that was only true for 9% of selim_sef’s very off-nadir predictions. As a result, the two had nearly identical recall scores in the very off-nadir imagery.

How would changing the IoU threshold have influenced each competitor’s recall scores? Let’s check three possible thresholds: 0.25, 0.5, and 0.75:

The fraction of ground truth footprints correctly identified (purple), identified at too low of an IoU score (orange), or missed completely (red) at three different IoU thresholds and stratified by look angle. Some competitors showed more dramatic performance drops than others as the threshold increases.

As you can see, each competitor’s correct identification rate (purple) drops as the threshold is increased; however, some drop more than others based on how well each prediction matched its respective building. The orange portion of the bars represent buildings that were identified, but “not well enough” — with an IoU score greater than zero but less than the threshold. The red bars represent buildings that were missed completely — an IoU of 0 — a population which, of course, is unaffected by IoU threshold.

Performance by look angle

Next, let’s examine how each competitor’s algorithm performed at every different look angle. We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:

F1 score, recall, and precision for the top five competitors stratified by look angle. Though F1 scores and recall are relatively tightly packed except in the most off-nadir look angles, precision varied dramatically among competitors.

Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their [tight packing at the top of the leaderboard]. Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees). Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees. cannab seems to have won on his algorithm’s performance on very off-nadir imagery!

An interesting takeaway from the bottom two graphs is that competitors had a bigger separation in their precision than in their recall, meaning that there was more variation in false positive rates than false negative rates. With the exception of a substantial drop-off by selim_sef’s algorithm in the most off-nadir imagery, the five competitors’ recall scores were almost identical throughout the range of look angles. By contrast, selim_sef had markedly better precision in the very off-nadir images than any other competitor, though cannab also clearly beat the other three prize-winners in this metric. cannab and selim_sef were the only two competitors to use gradient boosting machines to filter false positives out of their predictions, which likely gave them the upper hand in precision.

One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges. The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings:

Two looks at the same buildings at nearly the same look angle, but from different sides of the city. It’s visually much harder to see buildings in the South-facing imagery, and apparently the same is true for neural nets! Imagery courtesy of DigitalGlobe.

This pattern was even stronger in our baseline model. Look angle isn’t all that matters — look direction is also important!

Seeing how similar these patterns are, we next asked how similar the competitors’ predictions are. Did they identify the same buildings and make the same mistakes, or did different algorithms have different success/failure patterns?

Similarity between winning algorithms

We examined each building in the imagery and asked how many competitors successfully identified it. The results were striking:

Histograms showing how many competitors identified each building in the dataset, stratified by look angle subset. The vast majority of buildings were identified by all or none of the top five algorithms — very few were identified by only some of the top five.

Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins! This means that the algorithms only differed in their ability to identify about 20% of the buildings. The algorithms differed more in the very off-nadir range, but still only 30% of buildings were found by one or more of the competitors that were not found by all of them. Given the substantial difference in computing time needed to train and generate predictions from the different algorithms, we found this notable.

Another way to explore how similar the algorithms are to one another is to measure the Jaccard similarity between their predictions. For each pair of algorithms, we counted how many buildings both identified, then divided that by the set of buildings that at least one found (that is, the IoU of their prediction sets):

The Jaccard similarity between competitors’ prediction sets. Higher scores mean that competitors’ predicted more of the same buildings, and fewer different ones. Though the similarity decreased as look angle increased, they all remained very high — the lowest similarity score was 0.7, between XD_XD and number13’s algorithms in the very off-nadir images (>40 degrees off-nadir).

Note the scale bar on the right — the lowest Jaccard similarity between any prediction set was 0.7, and that only occurred between number13 and XD_XD in the very off-nadir angle subset. No two algorithms generated predictions that were less than 80% similar by this metric when considering the entire dataset.

False positives: where did algorithms fail?

As the algorithms’ correct predictions were very similar, we next asked if the same was true of their false positive predictions — places where algorithms incorrectly predicted buildings in the images. We split the false positives into two sets: all predictions that did not satisfy the IoU threshold of 0.5, and the subset that did not overlap with an actual building at all (an IoU of zero). We then went through every competitor’s false positive predictions, counting how many of them overlapped between pairs of competitors. We used the same Jaccard metric to quantify their similarity:

The Jaccard similarity between false positive predictions — that is, the fraction of false positives that overlapped with false positives from another competitor. The IoU threshold for an overlap between false positives was set to >0 — that is, any overlap between two false positive predictions was counted, with no threshold. The top panel represents all false positive predictions (IoU < 0.5 with buildings in the actual dataset), whereas the bottom panel represents predictions that did not overlap with actual buildings at all.

Though there was more variability in false positives than in correct predictions, we still found these results intriguing: the five different algorithms, comprising different neural net architectures, loss functions, augmentation strategies, and inputs, often generated very similar incorrect predictions. For example, cannab and selim_sef’s IoU = 0 false positives (those that did not overlap with an actual building at all) overlapped with one another over 95% of the time! After checking to ensure that these false positives did not correspond to actual buildings that were missed by manual labelers (a few were, but under 10% of total), we found some interesting examples to show here:

Examples of false positive predictions from cannab (red) and selim_sef (blue). The purple overlap at the top and bottom right represent predicted buildings where there was none — at the top, where there was a tree between two sheds, and in the bottom right, in an especially dark shadow from a tree. Note that these false positives were very rare!
Another example of false positives from cannab (red) and selim_sef (blue). They both predicted that the dilapidated foundation in between these two houses was an actual building. Imagery courtesy of DigitalGlobe.

Now that we’ve explored dataset-wide statistics, let’s drill down to what made some buildings easier or harder to find. For this portion, we’ll focus on cannab’s winning algorithm.

Performance vs. building size

The size of building footprints in this dataset varied dramatically. We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range? The graph below answers that question.

Building recall (y axis) stratified by building footprint size of varying size (x axis). The blue, orange, and green lines represent the fraction of building footprints of a given size. The red line denotes the number of building footprints of that size in the dataset (right y axis).

Even the best algorithm performed relatively poorly on small buildings. cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir. This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset. It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.

Occluded buildings

For the first time in the history of the SpaceNet Dataset, we included labels identifying buildings occluded by trees. This allows us to explore building detection in the densely treed suburbs surrounding Atlanta (see the image at the beginning of this post). So, how well did the best algorithm do at identifying buildings partially blocked from the satellite by overhanging trees?

cannab’s algorithm showed a small but appreciable drop in performance when measuring its ability to segment buildings occluded by trees. X axis, look angle of the image; y axis, recall.

cannab’s algorithm only showed a small drop in performance for occluded buildings. This is encouraging: it indicates that algorithms can learn to work around occlusions to find unusual-shaped subgroups of a class, still classifying their footprints correctly.

Conclusion

The top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions. Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks. Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance. Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!

What’s next?

This brings us to an end of the SpaceNet Challenge Round 4: Off-Nadir Building Detection. Thank you for reading and we hope you learned as much as we did. Follow The DownlinQ on Medium and the authors on Twitter @CosmiQWorks and @NickWeir09 for updates on the next SpaceNet Challenge coming soon!