SpaceNet 5 Results Deep Dive Part 1 — Geographic Diversity

Adam Van Etten
Dec 18, 2019 · 7 min read

Preface: SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e. building footprint & road network detection). SpaceNet is run in collaboration with CosmiQ Works, Maxar Technologies, Intel AI, Amazon Web Services (AWS), Capella Space, Topcoder, and IEEE GRSS.

In previous posts [1, 2, 3, 4] we discussed at length both the utility and challenges of the focus of SpaceNet 5: extracting road networks with travel time estimates directly from satellite imagery. In this post we outline the approaches of the winners, and discuss one key feature of SpaceNet 5: geographic diversity. For the initial public baseline of SpaceNet 5, we scored contestants on regions in Moscow, Russia, Mumbai, India, and San Juan, Puerto Rico. In an attempt to encourage contestants to create algorithms that generalize to new locales, the final standings were scored on different regions of those three cities, as well as on a final “mystery” holdback city: Dar Es Salaam, Tanzania. In the following sections we explore the robustness of models to unseen geographies, and find that for the SpaceNet dataset, neighborhood-level differences have a greater impact on model performance than inter-city variations.

1. SpaceNet 5 Cities and Road Properties

For training purposes we utilize both the SpaceNet 3 and SpaceNet 5 data, as described here. These six cities provide 8,900 km of training data and over 90,000 labeled roadways, split out over 4,900 ~400 x 400 m image chips. Each hand-labeled roadway contains metadata features such as road type (highway, residential, etc), number of lanes, surface type (paved, unpaved), etc. We use these features to infer road travel speed via APLS functionality (this post provides an example of speed limit inference). See Figures 1,2, 3 for dataset details.

Figure 1. SpaceNet roads training dataset size.

SpaceNet 5 utilized two different testing corpora. The test set used for the public leaderboard on Topcoder contained three cities and a total of ~1000 km of hand-labeled roads. Final results were computed from a separate private test set with ~1,500 km of labeled roads; this test set was curated from distinct regions of the three initial test cities (Moscow, Mumbai, San Juan), plus the the “mystery” city (Dar Es Salaam) whose existance was known to competitors, though of course the location was not revealed until after the challenge completed.

2. SpaceNet 5 Algorithmic Approaches

All five of the top submissions followed the approach of CosmiQ’s baseline CRESI algorithm. CRESI casts the ground truth geojson labels into multi-channel masks with each layer corresponding to a unique speed range. The next step is to train a segmentation model using a ResNet34+Unet backbone (also utilized in albu’s winning SpaceNet 3 submission). We then refine and clean the mask, skeletonize the refined mask, extract a graph structure from the skeleton, infer roadway speeds and travel times for each graph edge, and perform some final post-processing to clean up spurious edges and complete missing connections.

The algorithms submitted to SpaceNet all used this same general approach, though with significant differences in the segmentation models used, and the parameters of post-processing. See Table 1 for details of the segmentation models used by competitors. A following post will dive deeper into the speed / performance trade-off between the various approaches.

Table 1. SpaceNet 5 algorithm details.

3. SpaceNet 5 Performance

Let’s take a look at how competitors performed on the challenge. Figure 4 shows performance on the public test set of image chips that were distributed to competitors without attendant labels. Note that scores in San Juan were consistently higher than Moscow or Mumbai. Scores of APLS_time ~ 0.50 imply that while road networks certainly are not perfect, they are generally still routable, with an expected error in arrival time of ~50% (inspection of Figures 7, 8, 10, 11 may help elucidate this point).

Figure 4. Competitor scores on the public test set.

Figure 5displays the final results on the private test set that was not made available to competitors, and so results could not be optimized for this imagery.

Figure 5. Final results for SpaceNet 5.

Let’s dive into these results in a bit more depth.

3A. High variance in APLS scores

Even within cities, there is a large variance in score between image chips. This is evidenced by Figure 6, showing the histogram of APLS_time scores for the winning algorithm. The reasons for this variance are myriad, and will be explored in further detail in upcoming blogs. Certainly dirt roads and overhanging trees complicate road extraction, as illustrated by Figures 7, 8.

3B. Performance drops in private test, even within the same city

The drop in aggregate score between the public and private datasets surprised many contestants, and is illustrated in Figure 9.

Figure 9. Difference in APLS_time score between the public test set and the private test set.

What Figure 9 indicates is that even in cities with training data present, it is easy to overtrain a model for a specific test region. Overtraining certainly explains some of the reason why some of the leading competitors of the public leaderboard dropped out of the top 5 for the private test set (compare Tables 1 and 2 on our previous blog). When the trained models were tested on a different region in the exact same city the algorithms universally dropped in performance for all competitors and all cities (see Figure 10).

3C. Comparable scores on the “mystery” city

Across all 5 of the winning algorithms and the baseline model, we observe no significant decrease in performance on the “mystery” city of Dar Es Salaam compared to previously seen cities.

In Figure 12 we plot the APLS_time score on the private test set for the aggregate of the three training cities, along with the “mystery” city. We actually observe a slight increase in performance for Dar Es Salaam versus the training cities.

Figure 12. APLS_time performance on the private test set. Error bars denote the standard deviation of mean scores for each competitor.

Another way to view the data contained in Figure 12 is to inspect the average drop in score for each testing city, as well as the mean of the three cities in both the public and private test set (Moscow, Mumbai, and San Juan). We compare these differences to the difference in score of Dar Es Salaam to the mean score on the public test (see Figure 13). Note that the unseen city drops by less than the 3-city mean.

Figure 1. APLS_time performance drop across cities between the public and private test sets, averaged across the top 5 competitors and the baseline. Error bars denote the standard deviation of mean scores for each competitor. The Dar Es Salaam bar is the drop in performance from the public test mean of all cties.

Figure 12 and 13 illustrate that when averaging all scores across the hundreds of testing chips in each city for multiple competitors, we see no significant reduction in performance when applying trained models to an unseen locale. In fact we observe a (statistically insignificant) slight increase in performance.

Conclusions

In this post we dove into the variation of performance by geography of the best SpaceNet 5 algorithms. While all competitors used a similar approach to the baseline CRESI model (i.e. multi-class segmentation + skeletonization + graph extraction + post-processing + speed inference), there is significant variation in the segmentation models used and post-processing techniques. Nevertheless, we observe similar trends across all models: a large drop in score between public and private regions (especially San Juan), high intra-city variability, and comparable aggregate performance on the mystery city of Dar Es Salaam. So what does this all mean?

As intended, the SpaceNet 5 challenge structure did indeed produce generalizable models that can be applied to unseen geographies with reasonable performance.

We can also conclude that with the training set provided by SpaceNet (6 cities on 4 continents, ~9,000 km of roads, and >90,000 individual labeled road segments) intra-city variations in performance are larger than inter-city variations. This is evidenced by the fact that for all six tested algorithms, aggregate performance on the unseen city of Dar Es Salaam is consistent with performance on the training cities. The variation within each city is high, implying that neighborhood-level details are more important to road network extraction than broader city-scale specifics like: road widths, background color, lighting conditions, etc.

Stay tuned for upcoming posts where we explore the neighborhood-level features that are predictive of road network model performance, as well as a look at speed/performance tradeoffs.

The DownLinQ

Welcome to the official blog of CosmiQ Works, an IQT Lab dedicated to exploring the rapid advances delivered by artificial intelligence and geospatial startups, industry, academia, and the open source community

Thanks to Jake Shermeyer and Nick Weir

Adam Van Etten

Written by

The DownLinQ

Welcome to the official blog of CosmiQ Works, an IQT Lab dedicated to exploring the rapid advances delivered by artificial intelligence and geospatial startups, industry, academia, and the open source community

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade