Robustness of Limited Training Data: Part 5

Daniel Hogan
Oct 16, 2019 · 5 min read

Deep learning models for interpreting satellite imagery show increasing performance as the amount of training data is increased. We’ve looked at the details of how this plays out, but our previous studies of this topic have suffered from a weakness: they’ve all used just one model architecture, meaning the layout of artificial neurons and other details of the algorithm were the same each time. In this post, we’ll recreate our first analysis with a whole new model architecture, to see what changes and what stays the same.

This is the last post in the Robustness of Limited Training Data series, so we’ll close with some overall conclusions. You can read all the previous posts here: part 1, part 2, part 3, part 4. However, they are not prerequisites for what follows.

Two Model Architectures

To explore the dependence of model performance on the amount of training data, we train a model to identify building footprints in satellite imagery. Our imagery comes from the SpaceNet 4 data set, which contains 27 views of Atlanta taken with Maxar’s WorldView-2 satellite. The imagery, with a ground sample distance of about half a meter, is chipped into tiles of about 450m on a side. The model architecture is a lightly-modified version of the 5th-place SpaceNet 4 prize winning model submitted by challenge participant XD_XD. This model features fast inference, and the version used here replaces the original ensemble of three neural networks with a single neural network to speed up training. For brevity, we will refer to this model architecture as “architecture A.” Performance is evaluated with an object-wise F1 score (the SpaceNet metric) for approximately-correct building footprints. Figure 1 shows how that performance varies with the amount of training data. The red, green, and blue lines show the results broken down by viewing angle, while the black line is an overall result. Details on this can be found in previous posts.

Figure 1: Performance of model architecture A, as measured by F1 score, versus number of images used for training. Dotted lines are fitted curves.

Next, we repeat this process with a different model architecture to see if the results in Figure 1 hold up. Our second architecture is a lightly-modified version of the 1st-place SpaceNet 4 submission, from challenge participant cannab. The original submission’s ensemble of 28 neural nets is pared down to an ensemble of four for faster training. Each neural net uses a U-net layout with a SE-ResNeXt50 encoder for pixel segmentation. Possible positives are subjected to further winnowing prior to generating vector building footprints. We’ll refer to this model architecture as “architecture B.”

Architectures A and B both use U-nets to create pixel maps from which building footprint polygons are generated, but there the similarities end. The two architectures use different encoders and different amounts of post-processing. Architecture B is an ensemble while A is not. Two model architectures clearly can’t span the space of everything that’s possible, but these are different enough to be a reasonable test of any highly-model-specific tendencies.

With that in mind, we measure performance as a function of training data quantity for model architecture B. We follow the procedure in post 1, except for using the method to calculate error bars introduced in post 3. Figure 2 shows the result.

Figure 2: Performance of model architecture B, as measured by F1 score, versus number of images used for training. Dotted lines are fitted curves.

The results of model architectures A and B are broadly similar. In both cases, a rapid performance increase with low amounts of training data gives way to diminishing returns with higher amounts of data. Performance initially rises more quickly with architecture B, getting closer to its maximum value with less training data. Architecture B also shows more consistent performance, with shorter error bars (even after taking into account a factor of two from a change in the analysis procedure).

As previously shown for architecture A, architecture B results can be roughly fit by curves scaling as a constant minus an inverse power law. Such fitted curves for architecture B are included in Figure 2. In this case, the actual results do deviate slightly from the simple model, with the curves underpredicting the measured results at 67 tiles and overpredicting the results at 266 tiles. This deviation is too big to ascribe to a chance statistical fluctuation. In fact, a chi-squared test rejects the hypothesis that the simple curves alone can fully explain the variation in the observed performance. Nevertheless, the deviations from the fitted curves are still small compared to the dynamic range of the performance, so the model is still useful even though it is something of an oversimplification.

Finally, we ask what seems like a simple question: which architecture is better? Figure 3 shows the architecture B results again, which the architecture A results overlaid in lighter colors.

Figure 3: F1 score versus number of training images for architecture A (lighter shades, from Figure 1) and architecture B (darker shades, from Figure 2).

At around 1000 training images, architecture B shows much higher performance than architecture A. By around 20,000 training images, architecture B only barely outperforms A. Extrapolating with the fitted curves indicates that, with enough data, architecture A overtakes architecture B.

What this demonstrates is that the model architectures cannot be absolutely ranked. It might not be possible to answer the question “which architecture gets better performance?” — unless one specifies how much training data is to be used. Deep learning papers often pronounce a model to be a success if it outperforms a previous best model on some canonical data set. But a canonical data set necessarily has a fixed size. The results with architectures A and B give a clean example of a case where a comparison using a single quantity of training data cannot tell the full story.

Project Conclusions

To close this series on the robustness of models with limited training data, we revisit some of the key points we’ve come across along the way.

  • Foundational mapping model performance rises with data rapidly when data is scarce and slowly when data is abundant. This holds across model architectures and across geographies. Having a whole city’s worth of data is not a requirement for geospatial deep learning. A large fraction of that performance can be achieved with a small fraction of the data.

But in the end, the most important lesson is this: Geospatial deep learning works surprisingly well with surprisingly little training data. Millions of images are not required — the training data barrier to entry for geospatial deep learning is lower than many people think it is.

The DownLinQ

Welcome to the archived blog of CosmiQ Works, an IQT Lab

Thanks to Nick Weir and Adam Van Etten

Daniel Hogan

Written by

Daniel Hogan, PhD, is a data scientist at CosmiQ Works, an IQT Lab.

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.

Daniel Hogan

Written by

Daniel Hogan, PhD, is a data scientist at CosmiQ Works, an IQT Lab.

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store