Deep learning models for interpreting satellite imagery show increasing performance as the amount of training data is increased. We’ve looked at the details of how this plays out, but our previous studies of this topic have suffered from a weakness: they’ve all used just one model architecture, meaning the layout of artificial neurons and other details of the algorithm were the same each time. In this post, we’ll recreate our first analysis with a whole new model architecture, to see what changes and what stays the same.
This is the last post in the Robustness of Limited Training Data series, so we’ll close with some overall conclusions. You can read all the previous posts here: part 1, part 2, part 3, part 4. However, they are not prerequisites for what follows.
Two Model Architectures
To explore the dependence of model performance on the amount of training data, we train a model to identify building footprints in satellite imagery. Our imagery comes from the SpaceNet 4 data set, which contains 27 views of Atlanta taken with Maxar’s WorldView-2 satellite. The imagery, with a ground sample distance of about half a meter, is chipped into tiles of about 450m on a side. The model architecture is a lightly-modified version of the 5th-place SpaceNet 4 prize winning model submitted by challenge participant XD_XD. This model features fast inference, and the version used here replaces the original ensemble of three neural networks with a single neural network to speed up training. For brevity, we will refer to this model architecture as “architecture A.” Performance is evaluated with an object-wise F1 score (the SpaceNet metric) for approximately-correct building footprints. Figure 1 shows how that performance varies with the amount of training data. The red, green, and blue lines show the results broken down by viewing angle, while the black line is an overall result. Details on this can be found in previous posts.
Next, we repeat this process with a different model architecture to see if the results in Figure 1 hold up. Our second architecture is a lightly-modified version of the 1st-place SpaceNet 4 submission, from challenge participant cannab. The original submission’s ensemble of 28 neural nets is pared down to an ensemble of four for faster training. Each neural net uses a U-net layout with a SE-ResNeXt50 encoder for pixel segmentation. Possible positives are subjected to further winnowing prior to generating vector building footprints. We’ll refer to this model architecture as “architecture B.”
Architectures A and B both use U-nets to create pixel maps from which building footprint polygons are generated, but there the similarities end. The two architectures use different encoders and different amounts of post-processing. Architecture B is an ensemble while A is not. Two model architectures clearly can’t span the space of everything that’s possible, but these are different enough to be a reasonable test of any highly-model-specific tendencies.
With that in mind, we measure performance as a function of training data quantity for model architecture B. We follow the procedure in post 1, except for using the method to calculate error bars introduced in post 3. Figure 2 shows the result.
The results of model architectures A and B are broadly similar. In both cases, a rapid performance increase with low amounts of training data gives way to diminishing returns with higher amounts of data. Performance initially rises more quickly with architecture B, getting closer to its maximum value with less training data. Architecture B also shows more consistent performance, with shorter error bars (even after taking into account a factor of two from a change in the analysis procedure).
As previously shown for architecture A, architecture B results can be roughly fit by curves scaling as a constant minus an inverse power law. Such fitted curves for architecture B are included in Figure 2. In this case, the actual results do deviate slightly from the simple model, with the curves underpredicting the measured results at 67 tiles and overpredicting the results at 266 tiles. This deviation is too big to ascribe to a chance statistical fluctuation. In fact, a chi-squared test rejects the hypothesis that the simple curves alone can fully explain the variation in the observed performance. Nevertheless, the deviations from the fitted curves are still small compared to the dynamic range of the performance, so the model is still useful even though it is something of an oversimplification.
Finally, we ask what seems like a simple question: which architecture is better? Figure 3 shows the architecture B results again, which the architecture A results overlaid in lighter colors.
At around 1000 training images, architecture B shows much higher performance than architecture A. By around 20,000 training images, architecture B only barely outperforms A. Extrapolating with the fitted curves indicates that, with enough data, architecture A overtakes architecture B.
What this demonstrates is that the model architectures cannot be absolutely ranked. It might not be possible to answer the question “which architecture gets better performance?” — unless one specifies how much training data is to be used. Deep learning papers often pronounce a model to be a success if it outperforms a previous best model on some canonical data set. But a canonical data set necessarily has a fixed size. The results with architectures A and B give a clean example of a case where a comparison using a single quantity of training data cannot tell the full story.
To close this series on the robustness of models with limited training data, we revisit some of the key points we’ve come across along the way.
- Foundational mapping model performance rises with data rapidly when data is scarce and slowly when data is abundant. This holds across model architectures and across geographies. Having a whole city’s worth of data is not a requirement for geospatial deep learning. A large fraction of that performance can be achieved with a small fraction of the data.
- Various statistical methods enable more thorough analyses. Error bars are critically important for deep learning research, revealing when results are repeatable and when they’re even real.
- To get the maximum performance across multiple cities with a fixed (and equal) amount of data for each one, building a specific model for each city gives worse results than piling all that data together to train one general-use model.
- A simple functional form works well for fitting building footprint model performance across a variety of architectures and geographies. Differences in the fitted parameters point to poorly-understood differences in the cities themselves.
- A deep learning architecture that performs the best with low training data might not be the best given lots of training data.
But in the end, the most important lesson is this: Geospatial deep learning works surprisingly well with surprisingly little training data. Millions of images are not required — the training data barrier to entry for geospatial deep learning is lower than many people think it is.