Robustness of Limited Training Data: Part 4

Daniel Hogan
Sep 25, 2019 · 7 min read

When it comes to the relationship between geospatial neural network performance and the amount of training data, do geographic differences matter? In a previous post, we examined this question by training the same building footprint model using various amounts of data from four different cities: Las Vegas, Paris, Shanghai, and Khartoum. That led to a plot (Figure 1) of performance for each city, either using a model trained on the city in question or using a model trained on the combined data of all four cities. In this post, we’ll take a closer look at two questions that went unanswered in the previous post. First, what happens if we take a model trained on one city and apply it to a different one it’s never seen before? And second, why does just one of the cities — Khartoum, Sudan — respond to a training data deficit more resiliently than the others?

Figure 1: Average F1 score versus number of training images per city. “Individual model” denotes models trained only on the city for which their F1 score is shown, while “combined model” denotes models trained on all four cities (i.e., with four times as much training data). Reprinted from here.


To assess model transferability, models trained on one city are tested on the others. The training is repeated four times for each training city, using 759 randomly selected tiles each time. Figure 2 shows the resulting performance, as measured by average F1 score.

Figure 2: Mean F1 score for models trained on one city (horizontal axis) and tested on another (vertical axis). Training is done with 759 tiles that are 200m on a side.

Not surprisingly, the best results are achieved when the model is tested on the same city it’s been trained on. Looking at the off-diagonal terms, none exceeds those on the diagonal. Beyond that, the matrix is strongly asymmetric in places. For example, a model trained on Khartoum and tested on Vegas does far better (F1=.44) than one trained on Vegas and tested on Khartoum (F1=.07). This illustrates that transferability is not commutative.

We can also use Figure 2 as a way of understanding which cities are most similar in the appearance of their imagery, at least as concerns traits relevant to building footprint identification. For each pair of cities we assign a similarity score as defined in Figure 3.

Figure 3: Similarity score for two cities, denoted A and B. S is similarity, F is an F1 score, and the subscript means the model is trained on city and tested on city .

Then, we can make a “map” by plotting the network of cities in such a way that the more dissimilar cities are pushed further apart, as shown in Figure 4.

Figure 4: Graph of cities, with proximity based on similarity score.

Khartoum and Paris are the two most disparate cities; Paris is the most unique overall. An important point is that this might be due to meaningful features on the ground (e.g., Paris’s abundance of trees or distinctive architecture), but it could just as easily be due to incidental details of this collect (e.g., low light at the time the Paris photo was taken, or the choice to include a larger amount of rural terrain in the area of interest). Using more cities could help elucidate which factors are actually important.

Geography and the Low-Data Falloff

We’ll now switch gears and consider an anomalous feature back in Figure 1. As the amount of training data is decreased, model performance declines along with it. By the time there’s very little data, on the order of 12 tiles per city, performance plunges with even a small reduction in data. Three of the cities seem to fall off almost in unison, but the results for Khartoum in Figure 1 show a somewhat more gradual decline in performance as training data is reduced. This qualitative observation can be seen quantitatively in the numeric parameters of the fitted curves. Those curves (shown as dotted lines in Figure 1) follow a constant minus an inverse power law term. The power law exponents for the six curves from testing on cities other than Khartoum vary from city to city, but their average is 0.48±.10 (taking their standard deviation to be the uncertainty). The two curves from testing on Khartoum, however, have an average exponential term of only 0.20±.02, reflecting a less-rapid change in performance in the low-data regime. Additionally, the uncertainty for the model trained on Khartoum with 12 tiles is notably lower than the other individual cities’ 12-tile models.

In short, Khartoum is more resilient in a data-constrained situation. And since Khartoum is neither the best- nor worst-performing city in this regime, the issue appears not to be a simple matter of overall difficulty. Instead, something specific to Khartoum is going on.

So why is Khartoum different? We can rule out a chance statistical fluctuation. Here, we are rewarded for our patience in running enough trials to generate error bars. Comparing the one-standard-deviation error bars on the curves for Khartoum and Shanghai in Figure 1 shows that it is quite unlikely that their diverging behavior at low training data levels is due to chance alone.

We can also rule out an unusual distribution of building sizes in Khartoum. Figure 5 shows that the city’s distribution of building sizes is not particularly different from the other cities. Even the bimodal peak in Khartoum’s building size distribution is not unique, as Las Vegas shows the same feature.

Figure 5: Normalized histogram of building sizes in each city, as a function of the square root of their area.

Khartoum ranks neither first nor last in median building size, average count of buildings per tile, or percentage of land covered by buildings. As for the collect, the Khartoum imagery is not unusually dark or bright. It is the most off-nadir, but only barely so (compare Khartoum’s 25.7 degrees off nadir to Shanghai’s 20.5, both too low to see the largest effects of off-nadir angles).

What, then, can explain the difference? A visual inspection shows just how different Khartoum looks compared to the other cities. Figure 6 shows typical tiles from Khartoum and Vegas. We’ll compare these two cities in terms of low-level features likely to play a role in the first layers of the neural net: colors and edges.

Figure 6: Typical tiles from (a) Vegas and (b) Khartoum.

Figure 7 shows the range of colors in eight randomly selected tiles from Vegas and Khartoum. The colors are represented in the HSV color space, with hue (color of the rainbow) on the x-axis and value (darkness) on the y-axis. The saturation (intensity) of each point indicates how frequently pixels with the given combination of hue and value occur in the imagery. Vegas is a colorful place, with recognizable features such as the distinctive light green of backyard swimming pools. In contrast, the color palette of Khartoum is more restrained. (Another look at the role of color using this same data set can be found here.)

Figure 7: Distribution of colors across eight randomly-selected tiles from each city. The saturation of each pixel in the histogram scales logarithmically with how often the corresponding hue+value combination occurs in the tiles. Hues from 100 to 180 have been left off because the imagery includes almost no pixels in this range. Example objects of different colors are indicated in the Vegas plot, but this list is not exhaustive.

As for edges, we can get a feel for the distribution of edges in the sample tiles by applying a simple edge detector with image editing software. Figure 8 shows the result. Most edges in this Khartoum tile are associated with buildings, whereas edges are prevalent in both building and non-building parts of the Vegas image. That’s due to Khartoum having less variation in color and also due to other aspects of the scenery, such as the lower variation in ground cover type in Khartoum.

Figure 8: Edges in the tiles of (a) Vegas and (b) Khartoum shown in Figure 6. Images are processed with a difference-of-gaussians filter followed by an increase in brightness and contrast.

All of this suggests a hypothesis for why building footprint identification holds up better with little training data in Khartoum. It may be that a rule of thumb that “edges imply buildings” works exceptionally well in Khartoum, and such a rule would require very little training data to learn. The high variety of scenery in a city like Vegas (for which the high variety of colors and edges serve as simplistic proxies) would permit no such rule for low training data. Such variation, despite being a liability with low training data, nevertheless becomes advantageous given enough training data for the model to learn it.

While further evidence for this hypothesis could possibly come from hand-engineering an edge-based building footprint detector, actually proving it is a more difficult matter. One route would be to generate synthetic data, building synthetic city views with mixes of attributes to find out which ones lead to reduced performance degradation at the lowest training data levels. Another route would be to investigate the actual features being learning by the models with an explainable AI approach. Short of that, an understanding of this issue, like the issue of transferability discussed above, would benefit from studying a greater number of cities.

Next Step

In this post, we’ve investigated model performance in two especially challenging scenarios. First, we looked at models trained on a different city from where they were being put to the test. The performance matrix was shown not to be symmetric, and a way to visualize similarity was proposed. Second, we looked at models trained on very little data (12 tiles = half a square kilometer) to understand why some face steeper performance declines than others in the low-training-data limit. Comparing Vegas and Khartoum, the latter shows lower variation in hue, as well as a stronger relationship between edges and buildings. This inquiry demonstrated the challenges of uncovering what’s happening under the hood in deep learning.

In the next blog post in this series, we’ll look at the effects of changing a different variable. Instead of changing geographic location, it’s time to head back to Atlanta and try changing the model architecture itself.

The DownLinQ

Welcome to the archived blog of CosmiQ Works, an IQT Lab

Thanks to Adam Van Etten, Jake Shermeyer, and Nick Weir

Daniel Hogan

Written by

Daniel Hogan, PhD, is a data scientist at CosmiQ Works, an IQT Lab.

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.

Daniel Hogan

Written by

Daniel Hogan, PhD, is a data scientist at CosmiQ Works, an IQT Lab.

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store