Robustness of Limited Training Data for Building Footprint Identification: Part 1

Daniel Hogan
The DownLinQ
Published in
5 min readJul 9, 2019

It’s a question that gets asked over and over again: How much data do I need to train my neural network? In this blog post, we will explore that question and answer it in the context of a specific case from the field of remote sensing imagery analysis. We’ll show that small amounts of data can perform surprisingly well. Subsequent blog posts will look at whether the answer is affected by geography and model architecture.

Motivation: The Utility of Limited Data

Many different variables determine the ultimate mission impact of satellite imagery. To make sense of it all, CosmiQ Works approaches this topic through the conceptual framework of the Satellite Utility Manifold. The utility of any given satellite imagery data set depends simultaneously on a number of attributes, each of which warrants careful investigation. Previous DownlinQ blog posts have explored the utility of satellite imagery as a function of its resolution, revisit rate, number of bands, and viewing angle. Here, we investigate the utility of an imagery data set as a function of how much data it contains.

To define the scope of this analysis, we’ll look at the task of building footprint detection in high-resolution satellite imagery, with a ground sample distance (GSD) of less than a meter. We ask the question: is model performance largely data-limited, even with a city’s worth of data? Or are there opportunities for meaningful results with even a tiny amount of data? The time and expense of data labeling make this an important issue, so let’s find out.

Background: Learning Curves

Although the importance of data set size for deep learning is widely acknowledged, surprisingly little attention has been given to quantitative analyses of the relationship between data set size and model performance for a fixed model architecture.

Where such research has been done, it has primarily focused on multiclass classification tasks. In a classification context, the term “learning curve” describes a plot of either accuracy or error as a function of training data set size. An early paper on this topic (Seung et al., 1992) modeled neural networks as thermodynamic systems — and got published in a theoretical physics journal. One of that paper’s conclusions is that that learning curves should scale as a constant plus or minus an inverse power law term. Cortes et al. (1993) highlighted the practical applications of this result, and it was recently put to use by Cho et al. (2016) in a study of classifying medical images.

However, we have not found any analogously-thorough treatment of F1 scores for instance segmentation, of which building footprint detection is an example. Furthermore, we have not found much treatment of data set size issues in a satellite imagery context. (An exception to the latter, albeit not a deep learning example, is an early CosmiQ Works effort to use machine learning to measure ship headings.)

The Plan: SpaceNet4 Resources

To study the effect of imagery data set size on utility for building footprint identification, we train the same SpaceNet4 competitor model on varying amounts of data. The data sets released through SpaceNet are among the highest-quality open-source labelled satellite imagery, and the associated SpaceNet Challenge competitions result in the development and open-sourcing of high-performing geospatial deep learning models.

For this analysis of data set size, we use the SpaceNet4 data set, with imagery of Atlanta taken from a variety of viewing angles. The imagery is grouped into three categories: “nadir” with viewing angles within 25 degrees of nadir, “off-nadir” with viewing angles from 26 to 40 degrees off nadir, and “far-off-nadir,” with viewing angles exceeding 40 degrees off nadir. Where “overall” performance is reported, this is defined as a simple average of performance for the three categories. Some example imagery is shown in Figure 1. A follow-up study will look at nadir imagery from cities in the SpaceNet2 data set, to see whether the results are geography-dependent.

Figure 1: (a-c) Different imagery tiles for the same physical location: (a) An on-nadir view, (b) an off-nadir view from the north, (c) a far-off-nadir view from the south; (d) the images’ common ground truth building footprint labels.

As for the model, we use the building footprint detection model that placed fifth in the SpaceNet4 Challenge. Although not the top-performing submission, this model outpaces the winner on inference times by more than a factor of ten. It also has the advantage of a widely-familiar architecture, consisting of a U-Net with a VGG-16 encoder and corresponding decoder. Nevertheless the model, contributed under the screen name “XD_XD,” performs within 5% of the top-performing model overall. Another follow-up experiment will see how the results change if a different model architecture is substituted for this one.

As in the SpaceNet4 Challenge, model performance is evaluated based on F1 score (the harmonic mean of accuracy and precision) for finding building footprints with IoU (intersection-over-union, a measure of overlap with ground truth) above 0.5. The score is evaluated with testing data that is different from the training/validation data.

To expedite the analysis, some modifications are made to XD_XD’s design. Most importantly, the original ensemble of three models is replaced by just one, since this threefold reduction in training time results in a mere 3% decrease in performance.

The model was first trained with the full SpaceNet 4 training data set. This data set contains imagery tiles of 900 by 900 pixels, each corresponding to an area of 450m on a side, for 50cm GSD. The average tile has 63 building footprints. The SpaceNet 4 training data includes 1064 tile locations within Atlanta, with 27 views of each location. Since a quarter of the tile locations are set aside for validation, the number of images actually used for training is 1064 * 27 * 3/4, or about 21,500 images. After training with the full training data set, the process is repeated nine more times, reducing the amount of data by about one-half each time.

The Result: Limited Data Rises to the Occasion

The results of training the same model with different amounts of training data are shown in Figure 2. The key result is immediately evident: Model performance rises rapidly with training data when there is not much data available, but further increases in data provide diminishing returns. The shape of the curves is roughly logarithmic, although a more exact functional form will be developed in a subsequent blog post.

Figure 2: Model performance, as measured by F1 score, versus number of images used for training, excluding validation. These images include 27 views of each unique location. Dotted lines are fitted curves. The x-axis is amount of training data, NOT training time.

This means that the model when trained with even a small amount of data performs remarkably well. Compared to using the full data set, using just 3% of the data still provides 2/3 of the performance. To take an extreme example, training on the 27 views of the single location randomly selected for use here provides about 2/9 of the performance of training with the full data set. That’s almost a quarter of the maximum performance — with one eight-hundredth of the data.

There’s much more to unpack here, including discussions of just where those error bars and fitted curves in Figure 2 come from. (Here’s a preview: the error isn’t constant, and the fitted curves aren’t logarithmic!) Look for that, along with matters of training time, ensemble-building, and what it all means, in the next blog post in this series.

--

--

Daniel Hogan
The DownLinQ

Daniel Hogan, PhD, is a senior data scientist at IQT Labs and was a member of CosmiQ Works.