SpaceNet 5 Results Deep Dive Part 2: APLS to APLS

Published in

The DownLinQ

6 min readJan 23, 2020

Preface: SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e. building footprint & road network detection). SpaceNet is run in collaboration with CosmiQ Works, Maxar Technologies, Intel AI, Amazon Web Services (AWS), Capella Space, Topcoder, and IEEE GRSS.

In this post we dive into performance specifics and APLS metric score for each of the winning participants in SpaceNet 5. This APLS to APLS comparison of algorithmic predictions allows us to establish both the level of divergence among proposed solutions, and explore which features are predictive of road network prediction performance.

In our previous post we took a detailed look at how geographic diversity affects predictions for the SpaceNet 5 Challenge that sought to extract road networks with travel time estimates directly from satellite imagery. In that post we found, somewhat surprisingly, that neighborhood-level details are more important to road network extraction than broader city-scale properties like: road widths, background color, lighting conditions, etc. This post extends that geographic analysis to explore the neighborhood-level score variances and what can be deduced from these divergences. We find a surprising amount of variance in individual scenes, and differing behavior among cities.

1. Algorithm and Metric Overview

All the top 5 algorithms used this same general approach proposed by the baseline algorithm (multi-class segmentation model followed by mask skeletonization and graph extraction), yet there was significant variation in the segmentation models used, and the parameters of post-processing. See Section 2 of our previous post for further details.

Recall that the APLS metric compares two road graphs by aggregating the difference in shortest paths between important nodes (e.g. intersections) in the ground truth and proposal graphs. Therefore small changes in the predicted road network can have an outsized impact on the APLS score, since missing an important intersection will have cascading effects on multiple shortest path predictions. Furthermore, in SpaceNet 5 we used the APLS_time metric, where shortest paths are defined by the shortest travel time (rather than the commonly used shortest geographic distance); the use of shortest travel time as the measure of “shortest path” allows true optimized routing. Even if all the predicted roadways are geographically correct, the use of time adds another level of difficulty, as erroneous speed limit / travel time assignment will adversely affect APLS_time score.

2. Aggregate Scores

The top submissions for SpaceNet 5 were all quite close, with the winning score of APLS_time = 0.48 only 0.03 better than the fifth place finish (and the baseline algorithm). Figure 1 displays the APLS performance histograms for both the winning entry of XD_XD, and the aggregate histogram of the six algorithms under consideration. Of note in this figure is that the general shape of the histograms for each city (‘All’ which denotes all cities) is remarkably similar, just like the total APLS scores.

Figure 1. APLS performance histograms. Top: Aggregate scores for all 6 top submissions. Bottom: Winning scores from XD_XD, with a very similar shape.

3. Visual Comparison

While averaged scores are quite close for all 6 algorithms, the scene-level networks tell a somewhat different story. In Figure 2 we display a grid of predictions for various locales for each of the top 5 challenge winners, along with the baseline model. As anticipated, Figure 2 illustrates that seemingly small changes in prediction have a significant impact on APLS_time score.

Figure 2. Predicted road networks for each of the winning participants, with rows in descending order of place (xd_xd placed first). The ground truth network is underlaid in gray, while predicted roadways are colored by inferred speed, with width also proportional to speed.

4. Scene-level Variance

The significant variance in score across each testing scene merits further investigation. In Figure 3 we sort the 842 APLS_time scores in the private test set for the baseline model. In the top panel we also plot the score for the remaining models, showing significant scatter. Of note in the top panel is that all models possess significant scatter about the trend line. Put another way, while XD_XD’s score is 0.03 better than the baseline, this improved score is not due to a 0.03 increase in score among most chips, but rather a significantly different score on each chip (some better, some worse) that in the aggregate yields a higher score. This scatter is is more easily shown by Figure 3 (bottom) where we illustrate the standard deviation of the APLS_time score at each image chip.

Figure 3. Top: APLS_time performance on the 842 final test chips, with all 6 models plotted. The x-axis denotes the sorted rank of each test chip. Bottom: Performance of the baseline model, with the error bars denoting the APLS_time standard deviation of the 6 predictions at each chip.

It is worth diving a bit deeper into the variance at the chip level observed in Figure 3. In Figure 4 we plot the standard deviation of the 6 predictions for each test chip. Evidently, scores in Moscow had a much lower variance than San Juan. Across all cities, the average standard deviation of each chip (horizontal line of Figure 4) is 0.08, which is fairly high given that APLS scales from 0 to 1.

Figure 4. Standard deviation of the 6 APLS_time scores of each test chip for each city. As in Figure 3, the x-axis denotes the sorted rank of each test chip. The horizontal line marks the mean of the scene-level standard deviations.

5. APLS_delta

One final item to investigate is how well travel times are extracted by the top six algorithms. The APLS_length score measures the ability of proposals to produce graphs with correct connections and edges of the correct geometric length. If travel times were perfectly recreated, then the APLS_time score would exactly equal the APLS_length score. In reality, the added complexity of inferring road speeds and travel times adds another source of error, so APLS_time scores are lower than APLS_length scores. Figure 5 illustrates APLS_delta: the aggregated difference in APLS_length and APLS_time. We noted in Figure 4 that Moscow had a much lower variance in score than San Juan; Figure 5 illustrates that the travel times are better inferred in Moscow than San Juan as well, given the lower delta between APLS_length and APLS_time. On average, we note a drop of 0.09 in APLS score when inferring road travel time versus geographic distance.

6. Conclusions

In this post we dove into the scene level scores of the top SpaceNet 5 submissions. This APLS to APLS comparison on the 842 individual final test chips. Though aggregate APLS_time scores differ by only 0.03 among the winners, we observe a surprisingly high variance in score among individual scenes. Interestingly, even though scores were higher in San Juan than Moscow for most competitors (see our previous post, Figure 5), scene variance is actually lower in Moscow, as is the delta between APLS_length and APLS_time. Evidently, the topography of Moscow is similarly challenging among all scenes, and road speeds are more easily extracted, whereas San Juan has more variance between winning submissions and greater difficulty in extracting travel times.

Inspection of the road networks elucidates that small changes in predictions have a large impact on APLS scores. While large networks are more robust to minute changes in prediction (see CRESIv2), compact road networks like the ones used in the test set are fragile and chaotic in nature, particularly when considering route travel time. Despite this chaotic nature to small graphs, in a following post we will explore which features (if any) are predictive of APLS score. For now, we will sign off with another grid of road predictions for perusal.