RarePlanes - Training our Baselines and Initial Results

Jake Shermeyer
Jan 27, 2020 · 9 min read
Example outputs automatically detecting aircraft role with Mask-RCNN. [1]

Preface: This blog is part 2 in our series titled RarePlanes, a new machine learning dataset and research series focused on the value of synthetic and real satellite data for the detection of aircraft and their features. To read more check out our introduction here: link.


Our ultimate goal when building the RarePlanes dataset was to test if augmenting our training data with synthetic data could improve our ability to detect aircraft and their features. We were particularly interested in improving performance for rarely observed aircraft with limited observations in the training dataset. To best test the value of synthetic, we first needed to build several baseline models using only our real data from Maxar’s WorldView-3 satellite.

This blog post will explore three different increasingly-complex experiments with computer vision algorithms:

  1. How well can we detect any aircraft?
  2. How well can we detect an aircraft’s role? (i.e. a civil transport or military aircraft)
  3. Could we even detect an aircraft’s specific make?

This post will also provide some more specifics on the dataset, our computer vision models, and test the effects of object size, prevalence, and generalizability of detectors to new locations.

The Dataset

Before diving into the baselines and computer-vision world, it is helpful to provide some background on the dataset. In this section we cover the annotation process, observation locations, imagery specifics, as well as the distributions of aircraft types contained within RarePlanes.

The Annotation

Each aircraft is labeled in a diamond style with annotators instructed to label the nose, left-wing, tail, and right-wing in order. This annotation style has the advantage of being simplistic, easily reproducible, convertible to a bounding box, and ensures that aircraft are consistently annotated, as other formats can often lead to imprecise labeling. Furthermore, this annotation style enables us to pull out two valuable features of aircraft: Their length and wingspan. With a bit of geopandas and Solaris magic, we can measure from the first node to the third, and from the second node to the fourth, giving us a cross section of the plane with good estimates of both length and wingspan.

Figure 1: An example of the annotation process. [1]

Furthermore several attributes are labeled, including: wing-shape, wing-position, propulsion, number of engines, number of vertical-stabilizers, if it has canards, and aircraft role. When combined with the length and wingspan, such attribution is particularly helpful when attempting to identify the make and role of an aircraft. A detailed taxonomy document describing all of these attributes and the labeling process will be included with the open-sourcing of the data (targeting late spring 2020).

Locations and Imagery

We performed a stratified random sample using OSM’s search API to find airfields of area ≥ 1 sq. km using climate zone as our stratification layer. We then select imagery and attempt to maximize the amount of diversity in the dataset with multiple seasons and weather conditions (clear skies, snow, haze). The dataset ultimately contains 621 WorldView-3 satellite images spanning 231 locations in 31 countries.

Figure 2: The locations of all civil airfields in the RarePlanes dataset.

Before training we split our data into a training (75%) and test set (25%). We carefully stratify our data into military and civilian airfields, then stratify once more by country. As many aircraft are country specific, we want to ensure our model has seen enough information in each country to be able to make reliable predictions. Many of our locations have a time-series component meaning that there is more than one satellite image (from different dates) labeled at that specific location. This means we can train our model on a location and then test on a different date at that same location (a seen AOI). Conversely, we can test our model on areas it has not previously been trained on (an unseen AOI). We split our data to ensure that both instances occur fairly evenly, for a robust testing of generalizability to new previously unseen areas.

Distribution of Aircraft Makes and Roles

Understanding our dataset distribution is important to assess our performance metrics. In Figure 3 and 4 we showcase our training and testing set distributions for aircraft make and role.

Figure 3: The distribution of unique makes of military aircraft present within the RarePlanes dataset.

The RarePlanes dataset follows a long-tail distribution with many few-shot and even zero-shot learning examples of different makes of aircraft. This allows us to evaluate how performance varies based upon the number of training examples of each aircraft type.

Figure 4: The distribution of unique aircraft roles present within the RarePlanes dataset.

The role of aircraft also follows a similar long-tail distribution with seven prevalent classes and two rare classes.

Baseline Models, Experiments and Results

This section covers the models we elected to work with, how we evaluated performance, and describes the results of the experiments we ran with the dataset.

The Neural Networks

To baseline our performance metrics we first chose two industry standard computer vision models: one for object detection: YOLOv3 and one for both object detection and instance segmentation: Mask R-CNN. We modified some of basic parameters for both of these models, mainly how they decide which regions could potentially contain objects of interest. This is called modifying the “anchors”, we do this so that both models can more easily detect smaller objects, which are much more common in overhead imagery vs. traditional photography. Finally, we build custom data loaders to easily read in geospatial imagery and vector labels, convert them to a model friendly format, and then begin training. We train and test on RGB pan-sharpened imagery. Specific details such as hyperparameters and optimizers will be described in an upcoming paper.

Evaluation Metrics

In terms of metrics we analyze a few specific values of interest:

  • The overall mean average precision (mAP).
  • Test how mAP varies based upon object size.
  • Test how mAP varies based upon the number of examples for each aircraft make in our training-sets.
  • Test how mAP varies in previously seen vs. previously unseen AOIs.(defined above).

Note that we report bounding-box metrics only. A true-positive is defined with an IOU threshold ≥ 0.5. Instance segmentation scores were < 1% lower than bounding-box scores for Mask R-CNN.

Overall Results

We showcase a few results of detection (YOLOv3 and Mask R-CNN) and instance segmentation (Mask R-CNN only) in Figure 5.

Figure 5: Various detection results from Mask R-CNN and YOLOv3 for three separate tasks from nine different locations. Top Row: Detecting aircraft make. Middle Row: Detecting aircraft role. Bottom Row: Detecting all aircraft. [1]

Visually, performance is quite strong, particularly for simply detecting aircraft. However our models still struggle with complex backgrounds, tree cover, and snowy scenes. The tasks of identifying aircraft role and make is much more challenging, but the models are reasonably successful, particularly for classes with many training examples, something we will dive into a bit later in this post.

Table 1: Overall performance with a threshold for positive detection with an IOU of 0.5 with 1 std. deviation bootstrap errors. We also examine the effects of object size on model performance. Size metrics are derived from MS-COCO calculation for mAP. Small: ≤ 32² pixels. Medium: >32² and ≤ 96² pixels. Large: >96² pixels.

Overall performance metrics show that as task difficulty increases, performance decreases. The overall mAP @ 0.5 IOU for the task of detecting aircraft is quite high at 94.5–96.5% respectively. The recall score (97.3% for YOLOv3) is near equivalent to human detection rates, which we observed during the labeling campaign were approximately 98%. For the task of detecting role, performance declines to 65–70.7%. The YOLOv3 model manages to detect the military recon aircraft successfully, buoying the mAP score by ~11%. Mask R-CNN fails to detect the rarely observed roles, but performs better vs. YOLO in classes with many examples. Finally for detection of aircraft make, Mask R-CNN (39.2%) outperforms YOLOv3 (31.7%) by ~8%.

Figure 6: (Small : Left) — Note the multiple missed small aircraft in this image. (Medium : Center) — The same type of aircraft being classified two different ways, and missed once. (Large : Right) — The make of all large aircraft being classified correctly. Overall performance is strongly correlated to object size.

As we’ve seen in some of our previous work, smaller objects are often much more difficult to detect than larger objects. Performance drops off for all tasks as objects become smaller in size. The most notable drop occurs for the Mask R-CNN for the task of identifying aircraft make dropping from 64.3% for large objects, down to 28.3% for smaller aircraft. We note that pre-processing the data using super-resolution techniques or gross oversampling (4x or greater) may help to improve performance, but will unfortunately slow training and inference time.

How Number of Examples in Training Set Affects Performance

Based on the long-tail distribution of the dataset, we can test how performance varies based upon the number of objects in the training set. mAP rises precipitously for both detection architectures as the number of training examples of each make increases.

Figure 7: The number of instances of certain makes of aircraft in the training set (x-axis) vs. the change in mAP (y-axis) for YOLOv3 (Top) and Mask R-CNN (Bottom). Standard error bars are plotted in blue.

We find a more gradual rise for YOLOv3, and bit more scattered yet still evident rise for Mask R-CNN. Mask R-CNN performs better in classes with fewer training examples vs. YOLOv3. Of note, when you have ≥ 10 examples mAP rises much faster for YOLO to a respectable ~67% and to ~60% for Mask R-CNN. Having ≥ 100 examples we observe an increase in performance to ~80% mAP for YOLO and ~70% for Mask R-CNN. Some of our previous work has also shown YOLO performs better with moderate to low training data (37 examples) vs. Faster R-CNN style models.

Testing Generalizability

As previously stated we test generalizability by scoring model performance on both seen and unseen AOIs (defined in the “Locations and Imagery” section).

Table 2: Performance differences for locations where another image has appeared in the training set (Seen AOIs) vs. locations the model has not be trained on (Unseen AOIs). We report scores only for classes that appear in both the seen and unseen AOIs. Again, 1 std. deviation bootstrap errors are reported.

In Table 2 we show that performance drops across the board when transitioning from seen to previously unseen AOIs. However the performance decline is much less significant when simply detecting aircraft vs. detecting the specific make or role of an aircraft. This suggests that it may be quite valuable to label another image for your areas of interest and include it with your training data, instead of attempting to train a large model that generalizes well to multiple locations.

In Figure 8 we demonstrate this showing how the simpler network succeeds in an unseen location at finding >90% of the aircraft, whereas the more complex network lacks the confidence to even make predictions for 9 of the 12 aircraft. As classification loss is just one component of the overall detection loss, it may be overwhelmed by the other components (Region Proposal Object/No Object, Region Proposal Localization, Bounding Box Localization, Segmentation Loss (Mask R-CNN only). This suggests that decoupling detection and classification into two different tasks may be a valuable approach, and could potentially greatly improve the ability to detect aircraft make. (Something we will investigate in the future).

Figure 8: Simple vs. sophisticated. The YOLOv3 model trained to find planes finds all but one aircraft (a casualty of Non-Max Suppression) in this scene, however its counterpart trained to find the make of aircraft lacks the confidence to predict on nine of the aircraft. [1]


This blog laid out all of the baseline experiments we ran on the real dataset. We observe that:

  1. Performance declines as task difficulty becomes more challenging.
  2. Even with prior optimization, detectors still struggle with small objects.
  3. Adding even 10 labels of your objects of interest to your training dataset could rapidly improve performance.
  4. Labeling historical images in your area of interest can greatly improve performance metrics.
  5. Simpler detectors generalize more easily to unseen areas.

In our next posts in this series we will begin to test the value of synthetic data, and if it can help us with the task of improving our detection results for different makes of aircraft.


[1] All imagery courtesy of Radiant Solutions, a Maxar Company.

The DownLinQ

Welcome to the archived blog of CosmiQ Works, an IQT Lab

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.

Jake Shermeyer

Written by

Data Scientist at Capella Space. Formerly CosmiQ Works.

The DownLinQ

As of March 2021, CosmiQ Works has been folded into IQT Labs. An archive will remain here to showcase historical work from CosmiQ Works that took place July 2016 — March 2021.