You Only Look Twice — Multi-Scale Object Detection in Satellite Imagery With Convolutional Neural Networks (Part I)

Adam Van Etten
The DownLinQ
9 min readNov 7, 2016

--

Detection of small objects over large swaths is one of the primary drivers of interest in satellite imagery analytics. Previous posts (4, 5) detailed efforts to localize boats in DigitalGlobe images using sliding windows and HOG feature descriptors. These efforts proved successful in both open water and harbor regions, though such techniques struggle in regions of highly non-uniform background. To address the shortcomings of classical object detection techniques we implement an object detection pipeline based upon the You Only Look Once framework. This pipeline (which we dub You Only Look Twice) greatly improves background discrimination over the HOG-based approach, and proves able to rapidly detect objects of vastly different scales and over multiple sensors.

1. Satellite Imagery Object Detection Overview

The ImageNet competition has helped spur rapid advancements in the field of computer vision object detection, yet there are a few key differences between the ImageNet data corpus and satellite imagery. Four issues create difficulties: in satellite imagery objects are often very small (~20 pixels in size), they are rotated about the unit circle, input images are enormous (often hundreds of megapixels), and there’s a relative dearth of training data (though efforts such as SpaceNet are attempting to ameliorate this issue). On the positive side, the physical and pixel scale of objects are usually known in advance, and there’s a low variation in observation angle. One final issue of note is deception; observations taken from hundreds of kilometers away can sometimes be easily fooled. In fact, the front page of The New York Times on October 13, 2016 featured a story about Russian weapon mock-ups (Figure 1).

Figure 1. Screenshot of The New York Times on October 13, 2016 showing inflatable Russian weapons mock-ups designed to fool remote sensing apparatus.

2. HOG Boat Detection Challenges

The HOG + Sliding Window object detection approach discussed in previous posts (4, 5) demonstrated impressive results in both open water and harbor (F1 ~ 0.9). Recall from Section 2 of 5 that we evaluate true and false positives and negatives by defining a true positive as having a Jaccard index (also known as intersection over union) of greater than 0.25. Also recall that the F1 score is the harmonic mean of precision and recall and varies from 0 (all predictions are wrong) to 1 (perfect prediction).

To explore the limits of the HOG + Sliding Window pipeline, we apply it to a scene with a less uniform background and from a different sensor. Recall that our classifier was trained on DigitalGlobe data with 0.5 meter ground sample distance (GSD), though our test image below is a Planet image at 3m GSD.

Figure 2. HOG + Sliding Window results applied to a different sensor (Planet) than the training data corpus (DigitalGlobe). This December 2015 image shows Mischief Reef, one of the artificial islands recently created by the Chinese, in the South China Sea. Enumerating and locating the vessels in this image is complicated by many false positives (red) derived from linear features on land, and the F1 score is quite poor. The bounding box colors here are the same as in 5, namely: false negatives are in yellow, false positives in red, hand-labeled ground truth is in blue, and true positives (which will overlap blue ground truth boxes) are in green. Running the HOG + Sliding Window detection pipeline on this image takes 125 seconds on a single CPU.

3. Object Detection With Deep Learning

We adapt the You Only Look Once (YOLO) framework to perform object detection on satellite imagery. This framework uses a single convolutional neural network (CNN) to predict classes and bounding boxes. The network sees the entire image at train and test time, which greatly improves background differentiation since the network encodes contextual information for each object. It utilizes a GoogLeNet inspired architecture, and runs at real-time speed for small input test images. The high speed of this approach combined with its ability to capture background information makes for a compelling case for use with satellite imagery.

The attentive reader may wonder why we don’t simply adapt the HOG + Sliding Window approach detailed in previous posts to instead use a deep learning classifier rather than HOG features. A CNN classifier combined with a sliding window can yield impressive results, yet quickly becomes computationally intractable. Evaluating a GoogLeNet-based classifier is roughly 50 times slower on our hardware than a HOG-based classifier; evaluation of Figure 2 changes from ~2 minutes for the HOG-based classifier to ~100 minutes. Evaluation of a single DigitalGlobe image of ~60 square kilometers could therefore take multiple days on a single GPU without any preprocessing (and pre-filtering may not be effective in complex scenes). Another drawback to sliding window cutouts is that they only see a tiny fraction of the image, thereby discarding useful background information. The YOLO framework addresses the background differentiation issues, and scales far better to large datasets than a CNN + Sliding Window approach.

Figure 3. Illustration of the default YOLO framework. The input image is split into a 7x7 grid and the convolutional neural network classifier outputs a matrix of bounding box confidences and class probabilities for each grid square. These outputs are filtered and overlapping detections suppressed to form the final detections on the right.

The framework does have a few limitations, however, encapsulated by three quotes from the paper:

  1. “Our model struggles with small objects that appear in groups, such as flocks of birds”
  2. “It struggles to generalize objects in new or unusual aspect ratios or configurations”
  3. “Our model uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the original image”

To address these issues we implement the following modifications, which we name YOLT: You Only Look Twice (the reason for the name shall become apparent later):

“Our model struggles with small objects that appear in groups, such as flocks of birds”

  • Upsample via a sliding window to look for small, densely packed objects
  • Run an ensemble of detectors at multiple scales

“It struggles to generalize objects in new or unusual aspect ratios or configurations”

  • Augment training data with re-scalings and rotations

“Our model uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the original image”

  • Define a new network architecture such that the final convolutional layer has a denser final grid

The output of the YOLT framework is post-processed to combine the ensemble of results for the various image chips on our very large test images. These modifications reduce speed from 44 frames per second to 18 frames per second. Our maximum image input size is ~500 pixels for NVIDIA GTX Titan X GPU; the high number of parameters for the dense grid we implement saturates the 12GB of memory available on our hardware for images greater than this size. It should be noted that the maximum image size could be increased by a factor of 2–4 if searching for closely packed objects is not required.

4. YOLT Training Data

Training data is collected from small chips of large images from both DigitalGlobe and Planet. Labels are comprised of a bounding box and category identifier for each object.

We initially focus on four categories:

  • Boats in open water
  • Boats in harbor
  • Airplanes
  • Airports
Figure 4. YOLT Training data. The top row displays labels for boats in harbor (green) and open water (blue) for DigitalGlobe data. The middle row shows airplanes (red) in DigitalGlobe data. The bottom row shows airports and airfields (orange) in Planet data.

We label 157 images with boats, each with an average of 3–6 boats in the image. 64 image chips with airplanes are labeled, averaging 2–4 airplanes per chip. 37 airport chips are collected, each with a single airport per chip. We also rotate and randomly scale the images in HSV (hue-saturation-value) to increase the robustness of the classifier to varying sensors, atmospheric conditions, and lighting conditions.

Figure 5. Training images rotated and rescaled in hue and saturation.

With this input corpus training takes 2–3 days on a single NVIDIA Titan X GPU. Our initial YOLT classifier is trained only for boats and airplanes; we will treat airports in Part II of this post. For YOLT implementation we run a sliding window across our large test images at two different scales: a 120 meter window optimized to find small boats and aircraft, and a 225 meter window which is a more appropriate size for larger vessels and commercial airliners.

This implementation is designed to maximize accuracy, rather than speed. We could greatly increase speed by running only at a single sliding window size, or by increasing the size of our sliding windows by downsampling the image. Since we are looking for very small objects, however, this would adversely affect our ability to differentiate small objects of interest (such as 15m boats) from background objects (such as a 15m building). Also recall that raw DigitalGlobe images are roughly 250 megapixels, and inputting a raw image of this size into any deep learning framework far exceeds current hardware capabilities. Therefore either drastic downsampling or image chipping is necessary, and we adopt the latter.

5. YOLT Object Detection Results

We evaluate test images using the same criteria as Section 2 of 5, also detailed in Section 2 above. For maritime region evaluation we use the same areas of interest as in (4, 5). Running on a single NVIDIA Titan X GPU, the YOLT detection pipeline takes between 4–15 seconds for the images below, compared to the 15–60 seconds for the HOG + Sliding Window approach running on a single laptop CPU. Figures 6–10 below are as close to an apples-to-apples comparison between HOG + Sliding Window and YOLT pipeline as possible, though recall that the HOG + Sliding window is trained to classify the existence and heading of boats, whereas YOLT is trained to produce boat and airplane localizations (not heading angles). All plots use a Jaccard index detection threshold of 0.25 to mimic the results of 5.

Figure 6. YOLT performance on AOI1. The majority of the false positives (red) are due to incorrectly sized bounding boxes for small boats (thereby yielding a Jaccard index below the threshold), even though the location is correct. The HOG + sliding window approach returns many more false positives, and yields a lower F1 score of 0.72 (see Figure 5 of 5). Unsurprisingly (and encouragingly), no airplanes are detected in this scene.
Figure 7. YOLT Performance on AOI2. As above, the incorrect detections are primarily due to incorrectly sized boxes for boats under 10m in length. Relaxing the Jaccard index threshold from 0.25 to 0.15 reduces the penalty on the smallest objects, and with this threshold the YOLT pipeline returns an F1 score of 0.93, comparable to the score of 0.96 achieved by the HOG + Sliding Window approach (see Figure 6 of 5).
Figure 8. YOLT Performance on AOI3. The large false positive (red) in the right-center of the plot is an example of a labelling omission (error) which degrades our F1 score. Recall that for the HOG + Sliding Window approach the F1 score was 0.61 (see Figure 7 of 5).
Figure 9. YOLT Performance on AOI4. The F1 score of 0.67 is not great, though it is actually better than the F1 of 0.57 returned by the naive implementation of HOG + Sliding Windows (see the inset of Figure 8 of 5). Incorporating rotated rectangular bounding boxes improved the score of Figure 8 of 5 from 0.57 to 0.86. Including heading information into the YOLT pipeline would require significant effort, though may be a worthwhile undertaking given the promise of this technique in crowded regions. Nevertheless, despite the modifications made to YOLO there may be a performance ceiling for densely clustered objects; a high-overlap sliding window approach can center objects at almost any location, so sliding windows combined with HOG (or other) features has inherent advantages in such locales.

The YOLT pipeline performs well in open water, though without further post-processing the YOLT pipeline is suboptimal for extremely dense regions, as Figure 9 demonstrates. The four areas of interest discussed above all possessed relatively uniform background, an arena where the HOG + Sliding Window approach performs well. As we showed in Figure 2, however, in areas of highly non-uniform background the HOG + Sliding Window approach struggles to differentiate boats from linear background features; convolutional neural networks offer promise in such scenes.

Figure 10. YOLT results for Mischief Reef using the same Planet test image as in Figure 2. Recall that only DigitalGlobe data is used for boat and airplane training. The classifier misses the docked boats, which is unsurprising since none of the training images contained boats docked adjacent to shore. Overall, the YOLT pipeline is far superior to the HOG + Sliding Window approach for this image, with ~20x fewer false positives and a nearly 3x increase in F1 score. This image demonstrates one of the strengths of a deep learning approach, namely the transferability of deep learning models to new domains. Running the YOLT pipeline on this image on a single GPU takes 19 seconds.

To test the robustness of the YOLT pipeline we analyze another Planet image with a multitude of boats (see Figure 11 below).

Figure 11. YOLT pipeline applied to a Planet image at the southern entrance to the Suez Canal. As in previous images, accuracy for boats in open water is very high. The only false negatives are either very small boats, or boats docked at piers (a situation poorly covered by training data). The five false positives are all actually located correctly, though the bounding boxes are incorrectly sized and therefore do not meet the Jaccard index threshold; further post-processing could likely remedy this situation.

A final test is to see how well the classifier performs on airplanes, as we show below.

Figure 12. YOLT Detection pipeline applied to a DigitalGlobe image taken over Heathrow Airport. This is a complex scene, with industrial, residential, and aquatic background regions. The number of false positives is approximately equal to the number of false negatives, such that the total number of reported detections (103) is close to the true number of ground truth objects (104), though obviously not all positions are correct.

6. Conclusion

In this post we demonstrated one of the limitations of classical machine learning techniques as applied to satellite imagery object detection: namely, poor performance in regions of highly non-uniform background. To address these limitations we implemented a fully convolutional neural network classifier (YOLT) to rapidly localize boats and airplanes in satellite imagery. The non-rotated bounding box output of this classifier is suboptimal in very crowded regions, but in sparse scenes the classifier proves far better than the HOG + Sliding Window approach at suppressing background detections and yields an F1 score of 0.7–0.85 on a variety of validation images. We also demonstrated the ability to train on one sensor (DigitalGlobe), and apply our model to a different sensor (Planet). While the F1 scores may not be at the level many readers are accustomed to from ImageNet competitions, we remind the reader that object detection in satellite imagery is a relatively nascent field and has unique challenges, as outlined in Section 1. We have also striven to show both the success and failure modes of our approach. The F1 scores could possibly be improved with a far larger training dataset and further post-processing of detections. Our detection pipeline accuracy might also improve with a greater number of image chips, though this would also reduce the current processing speed of 20–50 square kilometers per minute for objects of size 10m — 100m.

In Part II of this post, we will explore the challenges of simultaneously detecting objects at vastly different scales, such as boats, airplanes, and airstrips.

May 29, 2018 Addendum: See this post for paper and code details.

--

--