Panchromatic to Multispectral: Object Detection Performance as a Function of Imaging Bands
By Adam Van Etten and Lee Cohn
The value of satellite imagery depends on a number of factors, a concept we refer to as the satellite utility manifold. In a previous post we discussed a three-dimensional manifold where satellite resolution and revisit rate formed the dependent axes. In this post we explore a different dimension: number of imaging bands.
Imagery typically has three bands composed of red, green, and blue (RGB) channels. Single band grayscale (or panchromatic) imagery is also common. Far less common are multispectral images comprised of greater than three bands. These extra bands are very useful for studying aerosols, crops, coastlines, material type, and surface temperatures (just to name a few) [1]. In the sections below we explore the effect that various bands have on building footprint detection using SpaceNet data. We examine the performance of two algorithms that have been engineered to evaluate an arbitrary number of imaging bands: the YOLT pipeline, as well as a highly modified version of the MNC algorithm.
We find that the utility of extra bands asymptotes quickly for building footprint detection, and multispectral VNIR data does not improve results over standard RGB images.
1. SpaceNet Data
The second SpaceNet dataset provides satellite imagery for four different cities (Las Vegas, Paris, Shanghai, Khartoum) with attendant GeoJSON labels for building footprints (see Figure 1). Imagery is comprised of grayscale panchromatic 30 cm GSD, as well as pan-sharpened 30 cm 3-band RGB imagery, and pan-sharpened 30 cm 8-band VNIR multispectral imagery.
2. Object Detection Algorithms
YOLT is a rapid satellite imagery object detection pipeline that outputs bounding box predictions for objects of interest (1, 2, 3, 4, 5, 6). For this post we have expanded YOLT capabilities to include an arbitrary number of imaging bands for training and inference.
Multi-task Network Cascades (MNC) is built atop Faster-RCNN and outputs polygonal predictions rather than bounding boxes. For this reason, in theory this algorithm should be better suited for building footprint detection than YOLT, as discussed in our previous post. As with YOLT, the MNC algorithm has been enhanced to handle multispectral imaging data.
The mathematics of neural networks, back-propagation, and stochastic gradient descent are indifferent to the number of imaging bands. Yet expanding from 3-band to multispectral images proves challenging in practice since most computer vision libraries are only built to handle 3-band imagery (or perhaps 4-band, if the fourth band is a transparency layer). Including the extra imaging bands into the deep learning frameworks therefore requires some engineering effort. A further complication is that pre-trained models can no longer be used, as publicly available pre-trained models are almost exclusively trained on 3-band RGB images.
3. Model Training and Evaluation Metric
We train a separate model for each of the four cities and each image type (1-band grayscale, 3-band RGB, and 8-band multispectral). This process yields 24 unique models between the two algorithms, which we evaluate on the SpaceNet test dataset. Each model takes 2–4 days to train, depending on the size of the training corpus. Predictions are evaluated via the F1 score, which is the harmonic mean of precision and recall and varies from 0.0 (all predictions are erroneous), to 1.0 (all predictions are correct). We define a true positive as any prediction with a Jaccard index (also known as intersection over union or IOU) of 0.5 or greater; therefore a prediction need not be perfectly aligned with the building footprint to be counted as a success.
3. Error Bars via Bootstrapping
For each city, we use bootstrap resampling of the test dataset to estimate error bars. Bootstrapping is a way of estimating statistical parameters by means of resampling data with replacement. Like other non-parametric statistical methods, bootstrapping does not make assumptions about the distribution of the sample (e.g. whether it’s normally distributed and hence can be characterized by parameters such as mean and variance). Instead, the assumption behind bootstrapping is that the sample distribution is a good approximation to the population distribution.
For each city, we compute error bars for YOLT and MNC F1 scores via bootstrapping as follows. We resample with replacement N test images, where N is the total number of images in the test image set. We then compute the F1 score for the bootstrapped sample. We do this a total of 10,000 times to obtain 10,000 bootstrapped F1 scores; the mean and variance of this array produces confidence intervals for our F1 scores.
4. Results
We compute and visualize results with a modified version of the SpaceNet Visualizer, altered to return scores for each image, and to utilize the following color scheme: red = false positive prediction, yellow = ground truth for a false negative, green = true positive prediction, blue = ground truth for a true positive. Figures 2 and 3 illustrate aggregate results for each city and band combination.
As evidenced by Figures 2 and 3, neither model is able to utilize the extra information in the additional five multispectral bands, and 8-band multispectral F1 scores are typically within errors of the 3-band RGB results. MNC suffers in the grayscale domain, though YOLT achieves nearly the same score with grayscale as RGB or multispectral. Las Vegas is the easiest city, as most buildings are well separated single family homes. Paris and Shanghai are somewhat more difficult, with larger apartment complexes and industrial regions. Khartoum is difficult due to low contrast between buildings and background, as well as many very small structures. Figures 4 and 5 below illustrate how the different models and image types compare to one another.
Close inspection of Figures 4 and 5 reveals some of the advantages of each algorithm. YOLT predictions generally have a higher F1 score, and scores are robust to image type (grayscale images yield comparable results to 3-band or 8-band images). MNC predictions are sensitive to grayscale images, and have a slightly lower overall F1 score. Yet the use of polygon predictions rather than bounding boxes yields more precise predictions; true positive predictions for MNC have a significantly higher Jaccard index than those for YOLT predictions.
5. Conclusions
In this post, we demonstrated the ability to ingest multispectral data into two convolutional neural network object detection frameworks: YOLT and MNC. While the engineering challenges of adapting these frameworks were non-trivial, we find no performance enhancement when using VNIR multispectral data compared to standard RGB 3-band imagery for detecting building footprints. For the MNC algorithm there is a significant gain when moving from grayscale 1-band imagery to RGB 3-band or 8-band multispectral imagery, though YOLT results are robust to image type. The bounding box predictions of YOLT are adequate for providing building location and a rough estimate of building area, though true positive predictions for MNC have a significantly higher Jaccard index than those for YOLT predictions.
There are many scenarios where multispectral VNIR data is beneficial (e.g.: vegetation cover, building material type, etc.). It appears, however, that building footprint detection via satellite imagery is not one of those scenarios. We look forward to applying our multispectral object detection algorithms to exploring additional object types or labelling schemas (e.g.: “house”, “farm”, “gas station” instead of just “building”) that elucidate the utility of multispectral imagery in the object detection realm.
*This post is the product of research by both Adam Van Etten and Lee Cohn.
- Thanks to lporter for helpful comments.
May 29, 2018 Addendum: See this post for paper and code details.