Aircraft detection with YOLOv3 in Tremblay-en-France, France

RarePlanes — Exploring the Value of Synthetic Data: Part 2

Jake Shermeyer
Published in
10 min readApr 22, 2020


Preface: This blog is part 4 in our series titled RarePlanes, a new machine learning dataset and research series focused on the value of synthetic and real satellite data for the detection of aircraft and their features. To read more check out our introduction here: link, our blog exploring baseline performance: link, or part 1 of our experiments leveraging synthetic data: link.


This blog is the second and final post of a two-part conclusion to the experiments in RarePlanes. In these posts we explore the value of synthetic data from AI.Reverie and test how it could improve our ability to detect aircraft and their features. In particular we were interested in improving performance for rarely observed aircraft with limited observations in the training dataset. Ultimately, our experiments in these two posts will test the ability to improve detection of an aircraft’s specific make and model only (e.g. Cessna A-37 or Douglas C-47 Skytrain) using synthetic data.

Our task: Detecting the make and model of an aircraft. In this figure we successfully detect Dornier Do 228’s with YOLOv3.

Recall, in part 1 we conducted experiments and reported results with the RarePlanes dataset to gain insights on the following questions:

  1. How do learning rates affect performance when your synthetic and real datasets have limited overlap in classes? — Best to maintain initial learning rate throughout training.
  2. How long should we train on synthetic data versus our real data? — Better to maximize training time on real data.
  3. Should we isolate synthetic and real data during training or train on a blend of synthetic and real concurrently? — Best to isolate real and synthetic data into separate portions of the training regimen.

The previous post and our post on baselines discuss our metrics and models, please revisit these post(s) if you have questions about these.

In part 2 we introduce a new synthetic dataset and discuss:

  1. After training a model using only real data, can we use a smaller synthetic dataset at the end of our training to attempt to boost our performance for detection of specific aircraft makes?
  2. Does synthetic data improve the ability to detect rarely observed aircraft?

Throw Out the Extras: A Targeted Dataset

A combination of AI.Reverie synthetic data (center) overlaid upon real Maxar WorldView-3 satellite imagery (background) at the Hartsfield-Jackson Atlanta Airport. AI.Reverie software enables the generation of synthetic data from variable look angles and spatial resolutions.

For this portion of the study, we generate a second smaller “targeted” dataset that features 200–300 examples per make(~10,000 total aircraft) of the 41 makes that are contained within our real dataset. We again use all 20 of the airfields present within our simulator and create data. This time we use only clear sky weather conditions between 8 and 11 AM and again set the spatial resolution to 30cm.

Figure 3: The frequency of unique makes appearing in the real RarePlanes training dataset and the targeted synthetic training dataset.

Working with the Targeted Synthetic Dataset

In this section we walk through our experiments working with the targeted synthetic dataset. Our training schema here is much different than the traditional approach: in this case we will only apply synthetic data at the end of our training process to attempt to boost scores in our classes of interest. To do this we ultimately use a double transfer learning method and train our models as such:

  1. Train on the entire real dataset maximizing mAP performance across all classes.
  2. Train on our targeted synthetic dataset.
  3. Train on a subset of the real dataset- this time training on images that contain only our military classes of interest.

We also construct a new baseline for these experiments to test if the synthetic data or the training on the real subset of classes actually boosts performance. Consequently, for our ‘new baseline’ trained on only real data we perform steps 1 and 3 as listed above.

Figure 4: Performance comparisons between the new baseline (orange) trained with all real data and the new training schema using the targeted synthetic dataset (green). The previous baseline as described in the last blog is also plotted with a dashed line. We report one-standard deviation bootstrap errors for each experiment. We plot mAP on the y-axis and show results for the full testing dataset (All) and results for only classes that appear in both real and synthetic datasets (Overlapping) on the x-axis.

The results of this work were intriguing on two fronts. First that this synthetic data training schema provided a valuable boost in performance (~11%) for YOLOv3. The YOLOv3 model trained with the targeted synthetic dataset also provided the highest performance of any experiment in the study (mAP of 0.74 in overlapping classes). Mask R-CNN also receives a small boost (~2%) from training on synthetic data.

Of note, the new baseline for Mask R-CNN which is trained on the full real dataset and then on a subset of real data containing only our classes of interest (right-most orange bar in Figure 4) is significantly better than the old baseline (trained only on the full real dataset). By adding this second small bit of training on the smaller real imagery subset mAP scores in classes of interest improved by 15% — rising from 0.62 to 0.71. Furthermore, mAP improves for all classes by 12% — rising from 0.39 to 0.44. Conversely, this technique does not improve YOLO’s baseline performance at all.

Can synthetic help with the detection of rarely observed aircraft?

One of our main research goals of this study was to test the value of synthetic data for the detection of rarely observed objects with few examples in the training data. We quantify this by grouping classes into buckets based on the number of training examples of each class in the real dataset. We do this to both increase the size of our test sets and ensure that these results are representative across multiple types of aircraft.

Figure 5: The effect of the number of training samples on overall performance in our classes (Overlapping) of interest for YOLOv3. As in previous figures, we plot the new baseline (orange) trained with all real data and the new training schema using the targeted synthetic dataset (green) . We report one-standard deviation bootstrap errors for each experiment. We plot mAP on the y-axis and bucket the number of training examples for each type of aircraft on the x-axis.

The results for YOLOv3 are intriguing and show that synthetic data improves performance regardless of the number of training examples. The increase in performance is most notable in the rarest of aircraft with ≤10 training examples. These results suggest that for this model synthetic data can be used to augment and improve performance for the detection of rarely observed aircraft.

Figure 6: The effect of the number of training samples on overall performance in our classes (Overlapping) of interest for Mask R-CNN. As in previous figures, we plot the new baseline (orange) trained with all real data and the new training schema using the targeted synthetic dataset (green) . We report one-standard deviation bootstrap errors for each experiment. We plot mAP on the y-axis and bucket the number of training examples for each type of aircraft on the x-axis.

The results for Mask R-CNN tell a different story and show a much more even performance regardless of if synthetic data was or was not used for augmentation. Overall, these results highlight that a deeper dive is required to breakdown what’s actually happening here and gain a more thorough understanding.

Alternative Metrics: mF1, mPrecision, mRecall

Although Mean Average Precision (mAP) is the gold standard metric for object detection, some researchers feel it’s still too black-boxy and overly complicated. A more direct approach is simply averaging the precision, recall, and F1 scores across all of our classes of interest. As with mAP, all classes in the test set are equally weighted and such metrics can be much more informative about the types of error present in each experiment. Out of each of these metrics mRecall may be the most informative. Ultimately it is likely more valuable to over-detect an object and be confident you’ve actually detected it rather than miss it entirely. With some post-processing or a human-in-the-loop, one could likely throw out the false positives easily.

Table 1: Differences in mF1, mPrecision, and mRecall scores for different experiments when looking at all classes of interest.

When investigating overall results (Table 1) for each of these metrics we find that interestingly mF1 is actually about even for both YOLOv3 and Mask R-CNN. When looking at every class of interest, mF1 improves by 1.8 to 2.5 points when adding synthetic data. However, the precision and recall differences are even more intriguing and give insights about how these models leverage synthetic data. YOLOv3 uses the synthetic data to lower the false negative rate (the number of missed aircraft) almost exclusively. Recall is increased by 9 points; however, this comes at a cost of decreased precision which drops by 3 points. Conversely, both recall and precision increase by a small amount for Mask R-CNN.

Tables 2 and 3: Percentage change (increase or decrease) in performance based upon the number of objects in the real training dataset. Positive numbers indicate a performance boost from synthetic data, whereas negative indicate that synthetic data can harm performance.

Investigating the value for rarely observed objects was the primary focus for this research; we break down the performance gains once more in Tables 2 and 3. For both models, when looking at rare object classes with ≤10 examples, precision and recall are improved. YOLOv3 in particular sees the most significant improvement across the board, but notably so in mF1 (329% improvement) and mRecall (1091% improvement). When the number of training examples is ≤ 30 or ≤100 the trend of small valuable precision and recall boosts continues for Mask R-CNN. For YOLOv3, recall is again greatly improved at the cost of some precision.

How do networks learn from synthetic data?

We wanted to investigate and at least estimate how each model is using synthetic data. Recall that object detection really comes down to a few pieces:

  1. Creating a proposal of what could be a potential object of interest.
  2. Classifying that proposal to a specific class.
  3. Refining the shape of the bounding box to precisely match the shape of the object.

Although all of these components work together in-concert, we can start to pick apart which network pieces are the most affected by the synthetic data for our two models. We construct a few tests to do this and report the results below. First, we calculate the maximum possible performance of the network if our classifier was perfect. This is directly correlated to the strength of the proposal network and we call this metric “Proposal mF1.” Given the Proposal mF1 and our Overall mF1, we can calculate the fraction of proposals the classifier actually gets right by dividing the Overall mF1 by the Proposal mF1. This is a measure of classifier strength. Finally, we can calculate the mIOU for correctly classified aircraft to test if synthetic data affects how accurately bounding boxes are created.

We will start with mIOU as it is the least interesting. The mIOU differences are negligible due to the addition of synthetic data as both networks are incredibly precise when creating bounding boxes. Out of a test-set of 6,471 aircraft, the YOLOv3 baseline (mIOU = 0.88) misses only 11 aircraft due to proposal bounding boxes not meeting the threshold of 0.5. The Mask R-CNN baseline (mIOU = 0.85) misses only 12 aircraft. When augmenting with synthetic that number increases from 11 to 12 for YOLO and from 12 to 18 for Mask R-CNN.

Tables 4 and 5: Percentage change (increase or decrease) when training on real data only vs. a combination of real and synthetic. These numbers help us to understand what synthetic data improves the most for each network. Positive numbers indicate a performance boost from synthetic data, whereas negative indicate that synthetic data can harm performance.

We report changes in proposal method strength vs. classifier network strength in Tables 4 and 5. The differences in how each of these detection networks uses synthetic data is apparent from this breakdown. YOLOv3 uses the synthetic data to optimize its proposal method where-as Mask R-CNN does the opposite, maximizing classifier performance. We again see the greatest value for synthetic data in the rarest of classes with ≤ 10 training examples, classifier strength is improved for both networks and proposal strength is augmented by nearly 200% for YOLOv3.

These differences ultimately come down to how the networks are designed. For example, YOLO uses a less sophisticated region proposal method than Mask R-CNN. In the original YOLO paper the authors argue and provide quantitative results that the region proposal method for R-CNN style models can overfit to the training data. When adding synthetic data into the training process, YOLO adapts quickly and learns to find more objects that could be of interest. However, its classifier does not improve much, and performance actually declines when looking at the full overlapping test-set. On the other hand, Mask R-CNN uses the synthetic data to refine its classifier. The bulkier proposal networks performance declines when looking at rare classes but remains about even overall. Instead the classifier starts to learn the most interesting features to optimize performance.


The results of these experiments highlight that your choice of model may affect how you leverage synthetic data. The targeted approach described above is effective in different ways for both models. As noted in the previous blog, Mask R-CNN also received a large performance boost (~11%) in our classes of interest by first training on the large synthetic dataset for 33% of the time, then training on real data for the final 66% of the time. YOLOv3 saw less benefit in a similar training approach.

Our final conclusions from the RarePlanes study:

  1. Various metrics indicate that synthetic data provides a valuable performance boost in the detection rare objects with limited training examples. This study indicates that the most notable boosts occur for objects with ≤10 training examples in your dataset.
  2. A double transfer-learning approach with a smaller targeted synthetic dataset can be an effective method to boost performance in classes of interest after you have already trained your model to maximum performance on real data.
  3. Synthetic data can affect performance in different ways that are specific to your model architecture. Our research suggests that synthetic data is most beneficial to YOLO’s object proposal step, whereas Mask R-CNN sees the most benefit to classification performance. Understanding these nuances is important to maximize the benefits of synthetic data.

From the previous blog:

  1. The traditional transfer-learning approach of training on a large synthetic dataset and then lowering the learning rate and fine-tuning on real data actually harms performance in our classes of interest. This is likely due to the limited overlap (~16%) between our real and synthetic classes.
  2. Given the limited overlap between the synthetic and real datasets, our experiments indicate that it is best to pre-train a model on synthetic data for a short portion (25–33%) of the total training time and then train on real data at the same learning rate. This boosts our performance by 5% (YOLOv3) to 11% (Mask R-CNN).
  3. Our experiments indicate that it is best to isolate synthetic and real datasets during the training phase. Training on blends of synthetic and real data appears to be less effective at boosting performance.

What’s Next?

In our next post, look out for a big announcement about the release of a portion of both the synthetic and real datasets. Special thanks to AI.Reverie and the rest of the CosmiQ team for making this work possible.


[1] All imagery courtesy of Radiant Solutions, a Maxar Company.



Jake Shermeyer
The DownLinQ

Data Scientist at Capella Space. Formerly CosmiQ Works.