The Effect of Resolution on Deep Neural Network Image Classification Accuracy
In this post, we further explore the boat-heading classification problem that we examined in previous posts (1, 2, 3). Specifically, we explore the impact of both spatial resolution and training dataset size on the classification performance of deep neural networks (DNNs). Results are similar to those achieved with a HOG-based classifier, and we provide a full comparison later in the post*.
2. Boat Heading Classification Datasets
Training and validation datasets are comprised of DigitalGlobe imagery cutouts of both boats and background regions, described in 1. High resolution image cutouts are augmented by re-projecting to a lower resolution (see Section 1 of 3). In brief: for classifier training we utilize labeled cutouts from two DigitalGlobe images at native resolution (0.34m and 0.5m) and down-sampled imagery at 0.5m and 1.0 m ground sample distance (GSD). For evaluation, we use DigitalGlobe imagery from a third validation image re-projected to [0.5, 1, 2, 3, 4, 5, 10] meter GSD, as well as Planet data assumed to be at native 3m GSD.
3. AlexNet Classifier
Using Caffe and Digits, we trained a 73 class (72 rotations, plus null) DNN classifier based on the AlexNet architecture. CosmiQ initialized weights and biases from the Caffe implementation of AlexNet that was trained using ImageNet data.
Training CosmiQ’s DNN required six hours on four high-end consumer NVIDIA GPUs (Titan X’s). The lengthy training (and evaluation) time of our model inhibits the use of bootstrap resampling to estimate confidence intervals that was employed in 2, 3. Since we are counting discrete events in our test dataset, we assume a Poisson distribution and therefore estimate the fractional error as N^(-1/2), where N = the number of boats. Our validation chipset contains 516 DigitalGlobe cutouts and 278 Planet images, with the distribution of boat lengths shown below in Figure 1.
Using our re-projected dataset we can study the effects of resolution on classification accuracy. Figures 2 and 3 below detail the performance of the classifier as resolution degrades.
In Figure 3, we present the results for the more difficult classification problem of boat heading. To properly capture the impact of object size on the results, we break out performance of the classifier by vessel size bin. Scoring was calculated consistent with the methodology described in Section 1 of 3.
There are a few results worth noting in Figure 3. Not surprisingly, headings of larger vessels are better classified than those of smaller vessels as the GSD increases. For the DNN classifier the Planet data yield results somewhat worse than those from the corresponding blurred DigitalGlobe imagery, and worse than those from the HOG predictions. While we cannot be certain why this is the case, this may be an example of overtraining of the DNN (recall that no Planet data was used for training). The HOG-based classifier relies on relatively simple gradient-based features for classification that may translate well between different datasets, whereas the far greater number of parameters of the DNN may be overtraining on features specific to DigitalGlobe data.
We combine the results of Figure 3 above with results from 3 (Figure 5) below. Recall that the HOG+LogReg classifier utilizes bootstrap resampling in estimating error bars, as opposed to the simple N^(-1/2) scaling used for the DNN.
4. Labeled Data Dependence
The effective use of machine learning algorithms for computer vision problems requires supervision with large amounts of labeled data. To gain insight into the impact of larger datasets, we ran an experiment to relate the accuracy of a trained classifier with the amount of labeled data used to train the classifier. It is important to note that the scope of this classification problem is extremely bounded and the required amount of training data may be incommensurate with other classification problems.
We achieve results qualitatively similar to the HOG+LogReg model (see Figure 10 of 3), though accuracy is generally converging after about 400 samples in the training dataset. This number is higher than the ~100 training samples required for conversion in the HOG+LogReg classifier (Figure 10 of 3). Yet either model leads us to conclude that modest-sized training sets may be sufficient for certain classes of problems. On the other hand, these results also suggest that one cannot significantly improve accuracy at a given resolution for this type of problem simply by increasing the size of the training set beyond a certain threshold.
In general, a logistic regression classifier trained on HOG features yields results that are comparable to those of the DNN. Model accuracy is not the only consideration, however, and implementation speed is also of critical importance given the ever increasing amount of data. In this section we investigate the computation requirements of various approaches.
The first step in image classification is model training. For a corpus of 44,000 images, training AlexNet on our four-GPU sever takes approximately six hours. HOG feature descriptors coupled with logistic regression take less than one minute to train on a single CPU on the same image corpus. However, training a classifier is typically an infrequent task, with minimal fine-tuning required for retraining with additional data. The evaluation time of an image classifier (the time to run one image through the classifier) is a more important value than the training time, as this better reflects the true operational cost of utilizing machine learning algorithms; the computation costs of various scenarios are shown below in Figure 6. We also include results for evaluating with GoogLeNet, an alternate (and deeper) architecture to AlexNet. As can be seen, a fundamental driver in computational cost is the high number of images to be evaluated, though the use of preprocessing steps and region proposal techniques may provide significant computational savings when applying image classifiers to large areas.
In this post we built upon the results of 3 and explore the performance of deep learning classifiers applied to differing resolutions and various training dataset sizes.
At the highest resolution, heading accuracies ranged from 65–80% depending on boat length. It is possible that a different neural network architecture or an ensemble approach that combined both DNN and HOG+LogReg results would improve accuracy rates. As we noted in 3 classifier performance is strongly dependent on vessel length, and degrades as GSD increases; the shape of the classification curve should help inform satellite imagery resolution requirements for various problems.
In the last few years, there have been several technological breakthroughs that have demonstrated DNN capabilities beyond what was considered possible. For the vast majority of computer vision tasks, DNNs are rapidly becoming the tool of choice. Nevertheless, our comparison of DNN and HOG+LogReg results demonstrates that for some classes of problems classical machine learning techniques can still compete with neural networks both in terms of speed and accuracy.
*Footnote: This post is the work of the entire CosmiQ team (Medium handles: @avanetten, @david.lindenbaum, @hagerty, @lisa_porter, @rlewis2016, @toddstavish).