You Only Look Twice (Part II) — Vehicle and Infrastructure Detection in Satellite Imagery

Rapid detection of objects of vastly different scales over large areas is of great interest in the arena of satellite imagery analytics. In the previous post (6) we implemented a fully convolutional neural network classifier (You Only Look Twice: YOLT) to rapidly localize boats and airplanes in satellite imagery. In this post we detail efforts to extend the YOLT classifier to multiple scales, both at the vehicle level and at infrastructure scales.

1. Combined classifier

Recall that our YOLT training data consists of bounding box delineations of airplanes, boats, and airports.

Figure 1. YOLT Training data (duplicate of Figure 4 from 6). The top row displays labels for boats in harbor and open water in DigitalGlobe data. The middle row shows airplanes in DigitalGlobe data. The bottom row shows airports and airfields in Planet data.

Our previous post (6) demonstrated the ability to localize boats and airplanes via training a 3-class YOLT model. Expanding the model to four classes and including airports is relatively unsuccessful, however, as we show below.

Figure 2. Results of four-class model applied to SpaceNet data on three different scales (120m, 200m, 1500m). Airplanes are in red. The cyan boxes mark detections of airports; only the largest box in the top image is a true positive. The remainder of the cyan detections are false positives caused by confusion from small scale linear structures such as highways.

2. Scale Confusion Mitigation

There are multiple ways one could address the false positive issue noted in Figure 2. Recall from 6 that for this exploratory work our training set consists of only a few dozen airports and a couple hundred airplanes, far smaller than usual for deep learning models. Increasing this training set size could greatly improve our model, particularly if the background is highly varied. Another option would be to use post-processing to remove any detections at the incorrect scale (e.g.: an airport with a size of 50 meters). Another option is to simply build dual classifiers, one for each relevant scale. We explore this final option below.

3. Infrastructure Classifier

We train a classifier to recognize airports and airstrips using the training data described in 6 of 37 Planet images at ~3m ground sample distance (GSD). These images are augmented by rotations and rescaling in the hue-saturation-value (HSV) space.

Figure 3. Successful YOLT detections of airports and airstrips (orange) in Planet images over both maritime backgrounds and complex urban backgrounds. Note that clouds are present in most images. The middle-right image demonstrates robustness to low contrast images. The bottom right image displays an airstrip on one of the reefs recently reclaimed by the Chinese in the South China Sea.
Figure 4. Challenges for the YOLT airport classifier. Left: the classifier correctly pulls out the airport in the bottom right, though also registers two false positives on the upper left. Right: the airport is correctly identified, though overlapping detections are redundant.

Over the entire corpus of airport test images, we achieve an F1 score of 0.87, and each image takes between 4–10 seconds to analyze depending on size.

4. Dual Classifiers — Infrastructure + Vehicle

We are now in a position to combine the vehicle-scale classifier trained in 6 with the infrastructure classifier of Section 3 above. For large validation images, we run the classifier at three different scales: 120m, 200m, and 2500m. The first scale is designed for small boats, while the second scale captures commercial ships and aircraft, and the largest scale is optimized for large infrastructure such as airports. We break the validation image into appropriately sized bins and run each image chip on the appropriate classifier. The myriad results from the many image chips and multiple classifiers are combined into one final image.

Overlapping detections are merged via non-maximal suppression, and all detections above a certain threshold are plotted. The relative abundances of false positives and false negatives is a function of the probability threshold. A higher threshold means that only highly probable detections are plotted, yielding fewer detections and therefore fewer false positives and more false negatives. A lower threshold yields more detections and therefore more false positives and fewer false negatives. We find a detection probability threshold of between 0.3 and 0.4 yields the highest F1 score for our validation images. Figure 5 below shows all detections above a threshold of 0.3.

Figure 5. YOLT classifier applied to a SpaceNet DigitalGlobe image containing airplanes, boats, and runways. Airplanes are in blue, while boats are in red and the airport detection in orange. Our plotting threshold of 0.3 yields few false negatives, though a number of false positives. In this image we note the following F1 scores: airplanes = 0.83, boats = 0.84, airports = 1.0.

5. Conclusions

In this post we applied the You Only Look Twice (YOLT) pipeline to localizing both vehicles and large infrastructure projects, such as airports. We noted poor results from a combined classifier, due to confusion between small and large features, such as highways and runways. We were, however, able to successfully train a YOLT classifier to localize airports.

Running the boat + airplane (vehicle) and infrastructure (airport) classifiers in parallel at the appropriate scale yields much better results. We yield an F1 score of greater than 0.8 for all categories. Our detection pipeline optimizes for accurate localization of clustered objects (not for speed), and even so it processes vehicles at a rate of 20–50 km² per minute, and 900–1500 km² per minute for airports. Results so far are encouraging, and it will be interesting to explore in future works how well the YOLT pipeline performs as the number of object categories is increased.

May 29, 2018 Addendum: See this post for paper and code details.