Robustness of Limited Training Data: Part 2

Daniel Hogan
Jul 19 · 7 min read

In a previous blog post, we asked how geospatial deep learning performance varies with the amount of training data, and we trained a building footprint detector with different amounts of data to see that variation in action. If you haven’t read that post, go back and take a look now, because next we’re going to dive into the details of how it was done and what it all means.

The model’s performance as a function of the amount of training data is shown in Figure 1 (identical to Figure 2 in the previous post). Performance is measured by F1 score for building footprint IoUs exceeding a threshold of 0.5, and the number of training images is the number of 450m by 450m tiles used from a 0.5m GSD dataset with 27 tiles (at different viewing angles) for each physical location.

Figure 1: Model performance, as measured by F1 score, versus number of images used for training, excluding validation. These images include 27 views of each unique location. Dotted lines are fitted curves. The x-axis is amount of training data, NOT training time.

As seen in Figure 1, the rapid rise of performance with training data when data is scarce contrasts with the diminishing improvements as training data increases further. This means that even small data sets can provide fairly good results. But to better understand what’s happening, we need to look more closely, starting with the question of how long it takes to train the model with different amounts of data.

Training Time

Diving further into the results, we first consider how long it takes for the model to train on different amounts of data. Figure 2 shows overall performance versus training time as each model is being trained. Using the full data set, it takes roughly 60 epochs for the model to converge on its optimum performance. Not surprisingly, the process is faster when there’s less data to train on. When the training images only number in the hundreds, the model reaches its maximum performance in less time than it takes to run through five epochs with the full data set.

Figure 2: Overall F1 score vs. training time. Training time is reported in terms of image impressions as well as approximate training time on an Nvidia Titan Xp GPU.

As shown in Figure 2, cases with the six largest amounts of training data were each trained for eight GPU-days on Titan Xp GPUs, while the cases with the four smallest amounts of training data were trained less to save computation time. Figure 1 reflects the state of each of these cases after a training time of either the maximum training time shown in Figure 2 or the equivalent of about 60 epochs with the full data set, whichever is lesser.

Error Estimation

Simply quoting a machine learning model’s performance metric, without also estimating the uncertainty of the result, is of limited value. Without error bars, it is unknown whether a model with a somewhat higher performance score is significantly better than one with a lower score. Indeed, without error bars it isn’t even clear whether retraining the same model will lead to similar results.

Figure 3 shows the the same plots as Figure 1, but with a logarithmic x-axis to more clearly show the low-training-data cases. Error bars were explicitly determined for the cases of 27, 162, 648, and 21546 images, and logarithmically interpolated for all other points. For each error bar to be determined, the model was retrained with the same amount of data four times, and the standard deviation of the four F1 scores was computed. For the two high-training-data cases, bootstrap resampling was used to approximate the effect of sampling from a larger underlying population. Because each error bar is computed from only four trials, the error of the error is itself high.

Figure 3: Identical to Figure 1, but with a logarithmic x-axis.

This careful analysis of errors, although computationally expensive, provides further context for the results. In particular, it shows that models trained with abundant data are consistent in their performance, while models trained with little data can vary widely in performance. When training on 21,546 images covering 1064 locations, the overall F1 score varies by only a few hundredths. Training on 27 images of a single location, however, produces F1 scores ranging from near zero to more than 0.2 depending on the location.

Curve Fitting and Extrapolation

It is useful to fit curves to the results in Figure 3, both for better understanding of the results and to enable extrapolation to other amounts of training data. Several functional forms were tried, including a logarithmic curve (which would make a straight line on Figure 3 because of the logarithmic x-axis) as well as a constant minus an exponentially-decaying term. However, a much better concordance with the data is achieved with a constant minus an inverse power law term, the form used to generate the fits actually shown in Figure 3 (and Figure 1). We’ve already seen this expression — it’s the same function that’s been shown to be a good fit for learning curves in classification problems.

A constant minus an inverse power law. The parameters a, b, and c are all positive. Here, the input x is the number of training images, and the output y is the predicted F1 score.

With fitted functions in hand, we can make some extrapolations. If this simple fit holds to arbitrarily large training data set sizes, it would imply the existence of an asymptotic maximum F1 score as the amount of data is increased. The maximum (if infinite data existed) would be 0.87 ±0.04 for nadir and a statistically equal 0.90±0.06 for off-nadir. (The far-off-nadir value cannot be measured to reasonable accuracy without training many more models to reduce statistical uncertainty.) It should be emphasized that these scores are significantly higher than what was actually observed, and it is an open question whether such an extreme extrapolation is valid.

Even if the simple function doesn’t hold to infinitely large amounts of training data, we stand on firmer ground in asking what would happen if we’d simply had twice as much data as was actually available. The answer there is straightforward: all the effort of doubling the data set size is predicted to produce only a modest 3% improvement in overall F1 score. The relative slopes of the curves indicate that the far-off-nadir score would see the greatest benefit from additional data, with off-nadir and nadir showing smaller gains.

As a technical aside, we can check whether the fluctuations of the points in Figure 3 about their respective fitted curves are consistent with the given error bars on those points. We numerically calculate the effective degrees of freedom for each regression and assume the error-bar-normalized residuals follow the corresponding chi-squared distribution. The resulting p-values, which are respectively .36, .16, .91, and .25 for nadir, off-nadir, far-off-nadir, and overall, are quite plausible for a correct fit.

A Look at Ensembling

A standard technique to increase deep learning model performance is to replace a single neural net with an ensemble of neural nets, which can be helpful even if all the neural nets in the ensemble have the same architecture. We can ask whether the amount of training data influences the efficacy of this technique.

XD_XD’s original code uses an ensemble, where each model in the ensemble is trained with a different quarter of the original data being held out for validation. That makes it difficult to isolate the improvement due specifically to ensembling. After all, if the ensemble outperforms one of its constituent models, it that because of the intrinsic benefits of the ensemble, or because the one model in isolation has access to 25% less data?

To measure the effect of ensembling directly, we instead train each model in the ensemble on the same training data set, withholding no data for validation (which we can forego here because appropriate training times were already found above). This was done in a “low-data” case with 216 images covering 8 locations, as well as a “high-data” case with all 28728 images covering 1064 locations.

In the low-data case, an ensemble of four models performs a full 10% better than the average performance of its constituent models. In the high-data case, however, the improvement is only 3%. This suggests that ensembling is more effective when training data is limited and provides less benefit when training data is abundant.


We have shown that the same functional form used to fit learning curves is also a good fit to plots of F1 vs data set size for this specific combination of problem, model architecture, data set, and evaluation metric. We conjecture that it would be a good fit for many other segmentation problems using this metric as well.

The takeaways from this analysis are of two kinds: general methods for studying training data set size dependence, and specific results for building footprint detection.

As for general methods, the key takeaway is that being thoughtful about how much training data one needs is an important part of any well-designed deep learning project. The approach of fixing model architecture and varying training data can provide insight into what can, and can’t, be gained from having more data. Although the needs of each project differ, aiming for the “sweet spot” with enough data to ensure consistent model performance but not so much data as to pay a high price for diminishing returns, is ideal. The methods for error estimation and curve fitting applied here are helpful tools for finding that sweet spot.

For building footprints specifically, the key takeaway is the high return from even limited training data. It doesn’t take “millions of images” to train a deep learning geospatial model — fewer than a thousand quarter-of-a-square-kilometer tiles, with different views of the same one or two hundred square kilometers, are enough to get two-thirds of the performance of a training data set that’s more than thirty times larger. The same feature of F1-vs-data-amount curves that creates diminishing returns from increasing the training data also makes model performance surprisingly robust when the amount of data is low.

Still, questions remain about how generalizable all of this is, and whether it carries over to other geographic regions or model architectures. Those questions will be taken up in future blog posts in this series.

The DownLinQ

Welcome to the official blog of CosmiQ Works, an IQT Lab dedicated to exploring the rapid advances delivered by commercial aerospace startups and the open source community

Thanks to Adam Van Etten

Daniel Hogan

Written by

Daniel Hogan, PhD, is a data scientist at CosmiQ Works, an IQT Lab.

The DownLinQ

Welcome to the official blog of CosmiQ Works, an IQT Lab dedicated to exploring the rapid advances delivered by commercial aerospace startups and the open source community

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade