The Effect of Resolution on Deep Neural Network Image Classification Accuracy

In this post, we further explore the boat-heading classification problem that we examined in previous posts (1, 2, 3). Specifically, we explore the impact of both spatial resolution and training dataset size on the classification performance of deep neural networks (DNNs). Results are similar to those achieved with a HOG-based classifier, and we provide a full comparison later in the post*.

2. Boat Heading Classification Datasets

Training and validation datasets are comprised of DigitalGlobe imagery cutouts of both boats and background regions, described in 1. High resolution image cutouts are augmented by re-projecting to a lower resolution (see Section 1 of 3). In brief: for classifier training we utilize labeled cutouts from two DigitalGlobe images at native resolution (0.34m and 0.5m) and down-sampled imagery at 0.5m and 1.0 m ground sample distance (GSD). For evaluation, we use DigitalGlobe imagery from a third validation image re-projected to [0.5, 1, 2, 3, 4, 5, 10] meter GSD, as well as Planet data assumed to be at native 3m GSD.

3. AlexNet Classifier

Using Caffe and Digits, we trained a 73 class (72 rotations, plus null) DNN classifier based on the AlexNet architecture. CosmiQ initialized weights and biases from the Caffe implementation of AlexNet that was trained using ImageNet data.

Training CosmiQ’s DNN required six hours on four high-end consumer NVIDIA GPUs (Titan X’s). The lengthy training (and evaluation) time of our model inhibits the use of bootstrap resampling to estimate confidence intervals that was employed in 2, 3. Since we are counting discrete events in our test dataset, we assume a Poisson distribution and therefore estimate the fractional error as N^(-1/2), where N = the number of boats. Our validation chipset contains 516 DigitalGlobe cutouts and 278 Planet images, with the distribution of boat lengths shown below in Figure 1.

Figure 1 (from 3). Counts of boats in each size bin for DigitalGlobe (left) and Planet (right) validation image cutouts. Bins are: 0–10m, 10–20m, 20–40m, 40+ meters. The maximum DigitalGlobe boat length is 97 meters, while the maximum Planet boat length is 349 meters. Minimums are 3m for DigitalGlobe and 14m for Planet. Totals are 516 for DigitalGlobe and 278 for Planet.

Using our re-projected dataset we can study the effects of resolution on classification accuracy. Figures 2 and 3 below detail the performance of the classifier as resolution degrades.

Figure 2. False positive rates (background images classified as a boat) and false negative rates (boat images classified as background) as a function of ground sample distance over all boat lengths. Inverted triangles denote Planet data, and lines depict DigitalGlobe data. As the GSD increases, the classifiers misclassify an increasing percentage of the boat images. Even though the classifier is trained only on DigitalGlobe imagery at 0.34 m, 0.5 m, and 1.0 m GSD, the performance against lower resolution data can be characterized as reasonable out to 3m. Note also that the performance using degraded DigitalGlobe imagery at 3m is comparable to that using 3m Planet imagery. Error bars are N^(-1/2), where N equals the number of boats. A comparison of Figure 7 of 3 indicates that the HOG+LogReg model generates similar results, although the false negative rate appears to increase more slowly with degrading resolution in the HOG+LogReg model.

In Figure 3, we present the results for the more difficult classification problem of boat heading. To properly capture the impact of object size on the results, we break out performance of the classifier by vessel size bin. Scoring was calculated consistent with the methodology described in Section 1 of 3.

Figure 3. Heading accuracy by vessel length. The classifier is applied to an independent test image from DigitalGlobe (lines) blurred to represent varying GSDs, and to Planet images (inverted triangles). Error bars are conservatively estimated to be N^(-1/2). Planet positions are slightly offset in the horizontal for clarity. These results do not address the 180-degree ambiguity in boat heading and so results should improve somewhat if this is taken into account, especially for the largest boats (see Figure 5 of 3).

There are a few results worth noting in Figure 3. Not surprisingly, headings of larger vessels are better classified than those of smaller vessels as the GSD increases. For the DNN classifier the Planet data yield results somewhat worse than those from the corresponding blurred DigitalGlobe imagery, and worse than those from the HOG predictions. While we cannot be certain why this is the case, this may be an example of overtraining of the DNN (recall that no Planet data was used for training). The HOG-based classifier relies on relatively simple gradient-based features for classification that may translate well between different datasets, whereas the far greater number of parameters of the DNN may be overtraining on features specific to DigitalGlobe data.

We combine the results of Figure 3 above with results from 3 (Figure 5) below. Recall that the HOG+LogReg classifier utilizes bootstrap resampling in estimating error bars, as opposed to the simple N^(-1/2) scaling used for the DNN.

Figure 4. Comparison between HOG and DNN classifier heading accuracies for the four boat length bins. Lines denote DigitalGlobe data, and points with error bars show Planet data. Note that results are largely within error bars, though the HOG model performs better using Planet data to infer heading. Planet positions are slightly offset in the horizontal for clarity. Averaged over all boat lengths, the DNN model has a 75.5 +/- 3.3% success rate for 0.5m GSD DigitalGlobe imagery, while the HOG classifier has an 83.7 +/- 1.7% accuracy rate.

4. Labeled Data Dependence

The effective use of machine learning algorithms for computer vision problems requires supervision with large amounts of labeled data. To gain insight into the impact of larger datasets, we ran an experiment to relate the accuracy of a trained classifier with the amount of labeled data used to train the classifier. It is important to note that the scope of this classification problem is extremely bounded and the required amount of training data may be incommensurate with other classification problems.

We achieve results qualitatively similar to the HOG+LogReg model (see Figure 10 of 3), though accuracy is generally converging after about 400 samples in the training dataset. This number is higher than the ~100 training samples required for conversion in the HOG+LogReg classifier (Figure 10 of 3). Yet either model leads us to conclude that modest-sized training sets may be sufficient for certain classes of problems. On the other hand, these results also suggest that one cannot significantly improve accuracy at a given resolution for this type of problem simply by increasing the size of the training set beyond a certain threshold.

5. Scaling

In general, a logistic regression classifier trained on HOG features yields results that are comparable to those of the DNN. Model accuracy is not the only consideration, however, and implementation speed is also of critical importance given the ever increasing amount of data. In this section we investigate the computation requirements of various approaches.

The first step in image classification is model training. For a corpus of 44,000 images, training AlexNet on our four-GPU sever takes approximately six hours. HOG feature descriptors coupled with logistic regression take less than one minute to train on a single CPU on the same image corpus. However, training a classifier is typically an infrequent task, with minimal fine-tuning required for retraining with additional data. The evaluation time of an image classifier (the time to run one image through the classifier) is a more important value than the training time, as this better reflects the true operational cost of utilizing machine learning algorithms; the computation costs of various scenarios are shown below in Figure 6. We also include results for evaluating with GoogLeNet, an alternate (and deeper) architecture to AlexNet. As can be seen, a fundamental driver in computational cost is the high number of images to be evaluated, though the use of preprocessing steps and region proposal techniques may provide significant computational savings when applying image classifiers to large areas.

Figure 6. Applying image classifiers to a large imagery database poses a computational challenge; for objects of size 50m and a search area of 10,000 km2 we expect to inspect ~16 million sub-images. Neural networks must be run on GPUs, whereas the HOG+LogReg approach can either use GPUs or CPUs. Chipping a large image in the manner shown here scales reasonably well for the HOG approach, though poorly for DNNs. Advanced DNN techniques that reduce the number of sub-images to be analyzed is a matter of active and very promising research, however, so DNNs should not be dismissed as incompatible with large datasets.

5. Conclusion

In this post we built upon the results of 3 and explore the performance of deep learning classifiers applied to differing resolutions and various training dataset sizes.

At the highest resolution, heading accuracies ranged from 65–80% depending on boat length. It is possible that a different neural network architecture or an ensemble approach that combined both DNN and HOG+LogReg results would improve accuracy rates. As we noted in 3 classifier performance is strongly dependent on vessel length, and degrades as GSD increases; the shape of the classification curve should help inform satellite imagery resolution requirements for various problems.

In the last few years, there have been several technological breakthroughs that have demonstrated DNN capabilities beyond what was considered possible. For the vast majority of computer vision tasks, DNNs are rapidly becoming the tool of choice. Nevertheless, our comparison of DNN and HOG+LogReg results demonstrates that for some classes of problems classical machine learning techniques can still compete with neural networks both in terms of speed and accuracy.

*Footnote: This post is the work of the entire CosmiQ team (Medium handles: @avanetten, @david.lindenbaum, @hagerty, @lisa_porter, @rlewis2016, @toddstavish).