Super-Resolution and Object Detection: A Love Story- Part 3

Jake Shermeyer & Adam Van Etten

Example output of YOLT model at native 30 cm resolution. Cars are in green, buses/trucks in blue, and air-planes in orange.

In our previous two posts [1,2], we unveiled an introductory outline and showcased quantitative and qualitative results of super-resolution techniques on satellite imagery. In this post, we explore the final results of the interactions between resolution, super-resolution, and object detection performance. We evaluate performance across various transportation classes including small vehicles, buses/trucks, boats, small aircraft, and large aircraft for resolutions ranging from 30cm to 4.8m, and evaluate a unique enhanced product: 15cm GSD super-resolved imagery.

Additionally, in this blog we are pleased to announce the release of our arXiv Paper.

Performance and recalls curves for YOLT (Left) and SSD (Right) using a 0.25 IOU score.

For each model we compute mean average precision (mAP) performance on a 338-image test set at each resolution. We also train and test a model on native resolution imagery at double the sampling rate (with bicubic upsampling), giving a window size of 82 meters (versus 164 meters for 30–480 cm resolution). We perform this test in order to disentangle the issues of resolution and window size for the 15 cm super-resolved data. Comparing the 15 cm super-resolved predictions to the 2x oversampled imagery will indicate whether any difference in performance is due to the smaller window size or the super-resolution technique at 15 cm. Example precision recall curves are shown above. The YOLT model is clearly superior to SSD, particularly for small objects.

We next calculate changes in mAP for all models across all resolutions, which allows us to determine the degradation of performance as a function of resolution (See figures below). Sensor resolution is reported on the lower X-axis of this figure; for the super-resolution models this serves as the resolution of the input data, which is subsequently enhanced 2 or 4 times. The performance of the “native’’ imagery (solid blue line) demonstrates how performance degrades with decreasing sensor resolution. We also plot 1 sigma bootstrap error bars for each model group. It is apparent that RFSR is slightly more robust than both native imagery and VDSR at lower resolution, though VDSR provides a significant boost at the highest resolution. We observe worse performance when training on native imagery and testing on SR data (purple and red curves), or training on SR models and testing on native imagery (not shown). We note similar results between the 2x and 4x enhancement.

YOLT object detection performance (mean average precision) based on resolution for all classes. Sensor resolution indicates the native resolution of imagery or input resolution into our Super-Resolution pipeline. Error bars are calculated using a bootstrapping approach.
SSD object detection performance (mean average precision) based on resolution for all classes. Sensor resolution indicates the native resolution of imagery or input resolution into our Super-Resolution pipeline. Error bars are calculated using a bootstrapping approach.

An intentional byproduct of this study is the establishment of the object detection performance curve as a function of sensor resolution. The solid blue lines shown above indicate that object detection performance decreases by 25–35% when resolution degrades from 30 cm to 120 cm, and another 70–100% from 120 cm to 480 cm when looking across broad object classes.

Performance for all classes. For RFSR and VDSR at each resolution we note the error and statistical difference (a measure of how significant is this finding) from the baseline model (e.g. +0.5 σ). For the 30 cm (2x sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.

While super-resolution is not a direct replacement for native imagery, our results indicate that SR techniques do provide an improvement at most resolutions. For YOLT, the greatest benefit is achieved at the highest resolutions, as super-resolving native 30 cm imagery to 15 cm with VDSR yields a 2.7 sigma(+20%) improvement. VDSR 2x yields little improvement at lower resolution, averaging a 3% improvement for 60–480 cm. VDSR 4x combined with the YOLT model yields a +7% improvement averaged over all resolutions. For the YOLT model, RFSR 2x yields a +16% improvement at 30 cm, and provides a +10% improvement on average for all lower resolutions; RFSR 4x combined with the YOLT model yields a +9% improvement averaged over all resolutions.

For SSD the VDSR model performs at least 2.3 sigma better than native imagery for all but 60 cm resolution. For SSD the improvement at 480 cm is statistically quite significant, though this is primarily due to the mAP of 0.0 for native imagery. Performance increases significantly once objects are greater than ~20 pixels in extent. This trend extends across object classes, as shown in the performance curves for individual object classes. Also apparent is that a smaller field of view is preferred for detection of densely packed object such as trucks, as evidenced by the sharp uptick at 15 cm and 2x sampled native 30 cm.

In our final blog post on this topic, we will discuss our results further, look into class specific performance, and discuss the conclusions that can ultimately be drawn from this study. Be sure to check out our previous posts on this topic [1,2] and our codebases for RFSR, VDSR, and the SIMRDWN object detection framework.