Robustness of Limited Training Data: Part 3

Daniel Hogan
Sep 17 · 8 min read

When training a deep neural network to identify building footprints in satellite imagery, having more training data never hurts. But how much does more data help, and when is it worth the cost and difficulty of procuring it? We’ve seen that performance rises rapidly with training data quantity when data is limited, but huge amounts of data give diminishing returns. In this post, we will explore how geography affects that trend, by studying performance versus training data set size with satellite images of four cities from around the world. We’ll see that the overall pattern of diminishing returns holds across the board, but that there are also some city-specific characteristics. In a follow-up post, we’ll look more closely at these city-specific features and see what happens when you train a model on one city and test it on another.

This post is the third in a series on the robustness of model performance with limited training data. The previous posts, which looked at imagery collected in just one place (Atlanta), are here and here, but they’re not prerequisites for what follows.


The SpaceNet 2 Challenge saw the release of electro-optical imagery of four world cities from the WorldView-3 satellite, along with accompanying building footprint labels. We’ll use that data here, allowing for a comparison of the diverse urban environments of Las Vegas, USA; Paris, France; Shanghai, China; and Khartoum, Sudan.

Figure 1: Scenic views of (clockwise from top left) Vegas, Paris, Shanghai, and Khartoum.

The SpaceNet 2 imagery is chipped into tiles of 650 pixels on a side. At a ground sample distance of around 0.3m, that corresponds to about 200m. Weighting the four cities equally, the average tile has 21 buildings.

The deep learning model architecture used here for identifying building footprints is a lightly-modified version of a prizewinning model architecture submitted to the SpaceNet 4 Challenge by user “XD_XD.” The modified version, which was also used in the previous data robustness study of Atlanta, uses one neural net instead of an ensemble.

For each city, models are trained with four different quantities of training data. These quantities — 12, 48, 192, and 759 tiles — are separated by multiplicative factors of about four. In addition, a fifth set of models is trained using tiles from all four cities at once. These combined models use the same number of tiles per city as the individual city models, so the combined model with the most data is trained on 3036 tiles, or 759 per city. Most one-city models are trained for approximately 100,000 image impressions and most combined models are trained for approximately 400,000 image impressions. After training, every model is tested against four test datasets, one for each city. This study uses the same evaluation metric as SpaceNet building footprint challenges: an F1 score describing how often predicted building footprints are similar to ground truth building footprints.

For each of the twenty combinations of geography/training data amount, the training and testing procedure is repeated four times with a different randomly-selected subset of the data each time. The reported F1 score is the average of the four trials, and its error bar is half their standard deviation. (It’s a half because the error of a mean falls as the square root of the number of samples.) This method represents an improvement over the technique used in the earlier Atlanta data robustness study. The averaging makes the data less jittery and the error bars smaller. Having a more precise measure of the average model performance allows us to perceive small geographic differences that might otherwise get lost in the random performance fluctuations that occur each time a model is trained anew.

Results and Conclusions

The main results from this campaign of training and testing 80 models can be found in Figure 2. The figure shows model performance, measured by average F1 score, versus the amount of training data used. The solid lines with error bars are the actual results, and the accompanying dotted lines are curves fitted to those results. The color of each graph indicates the city where performance is being evaluated (as specified in the figure’s legend). For each city, two scenarios are shown. The lower graph, shown in a darker shade, is the performance of a model trained only on data from that city itself. The higher graph, shown in a lighter shade, is the performance of the “combined model” trained on data from all four cities. Note that the x-axis is the amount of training data per city, so the combined-model results are based on four times more training data than the individual city results alongside them.

Figure 2: Average F1 score versus number of training images per city. “Individual model” denotes models trained only on the city for which their F1 score is shown, while “combined model” denotes models trained on all four cities (i.e., with four times as much training data).

There are a lot of conclusions to be drawn from this, so let’s give it a closer look!

To start, the plot exhibits the trend noted at the start of this post: When data is limited, performance rises rapidly with increased data, but that rate of growth cannot possibly be sustained once data becomes abundant. The logarithmic x-axis of Figure 2 visually disguises just how extreme the change is. To take the results trained on only Paris, for example, the slope of a line connecting the two lowest data points is sixty times greater than the slope of a line connecting the two highest ones. The trend can be seen more clearly in Figure 3, which shows the same data with a linear x-axis.

Figure 3: Identical to Figure 2, but with a linear x-axis.

This has a fortunate consequence. Compared to labeling a large portion of a city or of a collect, one can get most of the performance with a small fraction of the training data. For all eight scenarios shown in Figure 2, training on 48 tiles per city instead of 759 gives more than 3/4 of the performance using a mere 1/16 of the data. The total area of 48 tiles is less than two square kilometers.

The dotted lines in Figure 2 are empirical fits to the data points, and here we see another result from our previous analysis of Atlanta that holds up across a variety of geographies. These curves follow a “learning curve” functional form: a constant minus an inverse power law term, with three free parameters. Having a simple empirical fit like this is useful for interpolation and extrapolation, as shown with the Atlanta data set elsewhere.

Although the different scenarios can all be fit with the same type of function, clearly they are not the same. The most noticeable difference is the wide range of F1 scores among cities given the same amount of training data. When it comes to identifying building footprints, some cities are more challenging than others.

Among these training scenarios, there’s a more subtle trend that also has implications for a real-world use case. Suppose I have a fixed amount of data for each city, and I want to get the highest performance possible. Should I train a separate bespoke model for each city, or should I pool all the data together to train one general-purpose model? In other words, does the benefit from having four times as much data outweigh the challenge of trying to create one generalized model for four cities from around the world? As Figure 2 shows, a model trained on four cities’ worth of data performs no worse than a model trained on the quarter of that data from the specific city where the model is being tested. That means the model architecture’s parameter space is big enough to hold four cities’ worth of learning, so it does not force any trade-off between performance and generality. In fact, it’s better than that: the model trained on the larger combined data set of all four cities consistently gets slightly higher performance than the city-specific models.

Figure 4 gives an example of reading Figure 2 to see this effect in action. If my only goal is to make the best possible model for, say, Paris, and all I have are 12 imagery tiles of Paris and 12 tiles apiece for three other cities, then a model trained on all 48 tiles from the four cities will, on average, outperform a city-specific model trained on the 12 tiles from Paris alone. Of course, if I can increase my Paris training data from 12 to 48 tiles, that will bring about an even greater benefit to the Paris model. But the key conclusion is that increasing training data by incorporating different geographies not only makes a more generalized model— it also makes a model that in every case outperforms city-specific models trained on less data.

Figure 4: Detail of Figure 2, showing the change in Paris F1 score from quadrupling the amount of training data. Quadrupling the training data by getting four times more data from the same city (diagonal line) shows the larger improvement, but quadrupling the training data by incorporating equal amounts of data from three other cities (vertical line) still helps some.

Finally, since we’ve been comparing the results of this analysis to a previous analysis of Atlanta, it’s worth seeing the results side by side, as shown in Figure 5. The Atlanta data shows lower and more steeply rising F1 scores for the same number of training images. The cause is not a geographic difference, but rather a difference in how the data sets are structured. Although the Atlanta tiles are larger (~450m on a side instead of ~200m), they are lower-resolution and also geographically redundant: each Atlanta location is shown in 27 different tiles from different angles. As a result, five times more Atlanta tiles are needed to cover the same amount of area. This, along with the intrinsic difficulty of lower-resolution and off-nadir imagery, affects the shape of the curves and increases the number of tiles needed to achieve performance comparable to the SpaceNet 2 cities.

Figure 5: F1 score versus amount of training data for the SpaceNet 4 Off-Nadir Atlanta data set and the SpaceNet 2 data set of world cities.

This discussion of training models on different geographic areas has focused on overall trends. These include the utility of small amounts of data and a simple empirical fitting function. The value of combining training data to build a global model has also been shown. In the next post in this series, we’ll take a closer look at two specific questions about the variations between cities: First, what happens if we train a model on one city and test it on a completely different city? Second, why does the performance seem to differ among cities when using the very lowest amounts of data? Stay tuned.

Welcome to the official blog of CosmiQ Works, an IQT Lab dedicated to exploring the rapid advances delivered by artificial intelligence and geospatial startups, industry, academia, and the open source community

Thanks to Adam Van Etten and Jake Shermeyer

Daniel Hogan

Written by

The DownLinQ
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade