Detecting aircraft make with Mask R-CNN — Carolina, Puerto Rico. [1]

RarePlanes — Exploring the Value of Synthetic Data: Part 1

Jake Shermeyer
Apr 14, 2020 · 9 min read

Preface: This blog is part 3 in our series titled RarePlanes, a new machine learning dataset and research series focused on the value of synthetic and real satellite data for the detection of aircraft and their features. To read more check out our introduction here: link or our blog exploring baseline performance: link


This blog is the start of a two-part conclusion to our experiments with synthetic data in the RarePlanes research study. In these posts we explore the value of synthetic data from AI.Reverie and test how it could improve our ability to detect aircraft and their features. In particular we were interested in improving performance for rarely observed aircraft with limited observations in the training dataset. In our previous post, we built several baseline models using only real data from Maxar’s WorldView-3 satellite and hand annotated aircraft labels. We take the lessons learned from this post and the baselines trained only on real data and begin to investigate the best ways to apply synthetic data to augment our training dataset and improve performance in our classes of interest. Ultimately, our experiments in these two posts will test the ability to improve detection of an aircraft’s specific make and model (e.g. Cessna A-37 or Douglas C-47 Skytrain) using synthetic data.

In this post (Part 1) we attempt to answer these questions:

  1. How do learning rates affect performance when your synthetic and real datasets have limited overlap in classes?
  2. How long should we train on synthetic data versus our real data?
  3. Should we isolate synthetic and real data during training or train on a blend of synthetic and real concurrently?

In Part 2 we will examine:

  1. After training a model using only real data can we use a smaller synthetic dataset at the end of our training to attempt to boost our performance for detection of specific aircraft makes?
  2. Does synthetic data improve the ability to detect rarely observed aircraft?

We preface these blogs by stating that this is just the tip of the spear for experimentation with synthetic geospatial data. Synthetic data remains in a nascent stage when applying it in the broader computer vision world and applying it effectively to challenging geospatial problems is ambitious. The results of these experiments are promising, but further research is required to unlock the true potential of synthetic data.

Synthetic Data: The Large Dataset

We ultimately built one large synthetic dataset (we’ll refer to this as the large dataset) and a much smaller targeted dataset containing only specific makes of aircraft (we’ll refer to this as the targeted dataset (covered in part 2)). We use AI.Reverie’s aircraft simulator which leverages the Unreal Engine to create all of our synthetic data. The simulator enables users to create hundreds of different aircraft types across 20 simulations of real world airfields. Users can specify a variety of different variables and heavily customize each synthetic dataset they create. Variables include: the types and number of aircraft to generate, weather conditions and their intensities, the time of day for captures, and specify a range of resolutions for the imagery. Combined together the simulator allows for a large variety of options and the ability to produce both highly randomized and targeted datasets for specific use-cases. Once all options are specified the simulator begins to generate synthetic images and aircraft.

We first developed a large synthetic dataset consisting of ~32,000 images and ~225,000 aircraft with 122 unique makes of aircraft. Of these 122, only 44 overlap with the real dataset and 35 overlap with the real test set. We will ultimately quantify performance changes primarily on these 35 aircraft makes. On the whole, this dataset features six unique weather conditions including: Clear Skies, Snow, Cloudy, Fog/Haze, Rain, and Dust Storm. Clear Skies weather conditions represents ~40% of the dataset with the other weather conditions receiving ~12.5% each. The simulated intensity of the weather varies between 50 and 100% and all captures fall between 8 and 11 AM local time to mirror the sun synchronous orbits present within the real dataset. Ground Sample Distance (GSD) is set to 30cm to closely mirror the GSD of our real data.

The Evaluation Metrics

We ultimately evaluate our performance using the Mean Average Precision (mAP) metric with an IOU @ 0.5 on two subsets of data:

  1. Overlapping mAP — We calculate mAP on 35 classes that appear in both the real test set and synthetic dataset only. This is our primary metric as we are most interested in boosting performance only in our classes of interest.
  2. All mAP — We calculate mAP on all 166 classes that appear in the real test dataset, some of which have corresponding synthetic examples, but many of which do not. This is a secondary metric to evaluate how synthetic data may affect overall performance — even in classes we are not interested in.

We also report one standard-deviation bootstrap error bars. We generate bootstraps by randomly sampling class specific average precisions with replacement. For each bootstrap we calculate the mean average precision. After 100,000 bootstraps, we calculate the standard deviation of bootstrap mean average precisions and report one standard deviation as our error bar.

A Note on Training

For all experiments we standardize each model so that it sees an equal number of objects (synthetic + real) during training and report maximum performance on our test set. For example, if we state that: “We first trained a model for 75% of the time on synthetic data then train on real data.” This means that the first 75% of the objects a model sees during training are synthetic aircraft and the final 25% are real aircraft. We report results for bounding box scores for both YOLOv3 and Mask R-CNN throughout. Note that all instance segmentation mAP scores for Mask R-CNN are typically 1% lower than bounding box mAP scores.

A Traditional Approach with Synthetic Data

Many papers [2, 3, 4, 5] authored on this topic suggest that we should use a simple transfer learning approach. First, a model is trained on synthetic data for the majority of our training time; then the model is fine-tuned on the real data by lowering the learning rate. Consequently, this was our first attempt and we do this by first training on the synthetic data for 66–75% of our total training time. Next, we fine-tune on real data by dividing the learning rate by 10 and then 100 for the remaining portion of total training time.

Unfortunately, this method doesn’t work very well for this dataset. In Figure 4 we can see that training on synthetic data causes performance declines for both models in Overlapping and All mAP. Ultimately the low learning rate imposes too stringent of a limit on models and they are unable to learn enough about the real dataset for the synthetic data to have any benefit to performance. Other studies who have had success with this approach often have perfect overlap between classes that appear in the real and synthetic datasets. Recall that the RarePlanes dataset is significantly different with only 16% of overlapping classes between these two datasets. Overall, the results of this showed that we need to train on real data with a higher learning rate; potentially training on real data for a longer portion of our training time.

How Long to Train on Synthetic Data?

As noted above, we next wanted to test how long to train on synthetic data relative to our real. Consequently, we constructed 3 experiments to test how varying training times on synthetic data could affect overall performance. In each of these experiments we maintain the learning rate at the same levels throughout training on both the real and synthetic datasets.

In Figure 5 we show the performance changes relative to the amount of training time on synthetic data: Short: 25–33%, Moderate: 50%, Long: 66–75%. We ultimately find that training on synthetic data for the shortest amount of time and then training on real data for the majority of the training time provides a performance boost in our classes of interest (overlapping) of ~4% for YOLOv3 and an~11% for Mask R-CNN. We also find that mAP for all classes declines in each of these training schemas. We theorize again that the small proportion of overlap between real and synthetic classes is playing a role here. Ultimately if one is interested in improving performance for all classes, corresponding synthetic data may be required for each of these classes.

Should we train on real and synthetic blends?

Given our previous experiments, we hypothesized that there could be some performance improvements if we were to first train on blends of real and synthetic data before switching to only real data. As training on synthetic for a short time then switching to real for the majority of training yielded the best results in our previous experiments we mimic this approach in these experiments. We use variable blends of synthetic and real data and train for 25–33% of the total training time and then switch to real data using the same learning rate.

In Figure 6, our first blend is 10:1 synthetic to real (dark red) and our second blend is 1.25:1 synthetic to real (light red). We test different sizes of blends to quantify performance changes if one were to use a smaller synthetic dataset. Regardless of the type of blend, this training schema appears worse and our experiments show little to no performance boost from using blends of synthetic and real data early in training. Of note, a larger amount of synthetic data does provide a small (statistically insignificant) boost in performance relative to the smaller proportion. We find that the blends limit the models ability to converge and identify features that could be most valuable for the discrimination of real aircraft.


Based on our initial experiments we can draw several conclusions from this work:

  1. The traditional transfer-learning approach of training on a large synthetic dataset and then lowering the learning rate and fine-tuning on real data actually harms performance in our classes of interest. This is likely due to the the limited overlap (~16%) between our real and synthetic classes.
  2. Given the limited overlap between the synthetic and real datasets, our experiments indicate that it is best to pre-train a model on synthetic data for a short portion (25–33%) of the total training time and then train on real data at the same learning rate. This boosts our performance by 5% (YOLOv3) to 11% (Mask R-CNN).
  3. Our experiments indicate that it is best to isolate synthetic and real datasets during the training phase. Training on blends of synthetic and real data appears to be less effective at boosting performance.

What’s Next?

In our next post we will dig deeper and investigate a new approach that maximizes training on real data and our classes of interest by using a new targeted synthetic dataset. Finally, we will examine how training on synthetic data changes performance in our classes of interest and if it improves the detection of rarely observed aircraft.


[1] All imagery courtesy of Radiant Solutions, a Maxar Company.

[2] Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S. and Birchfield, S., 2018. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 969–977).

[3] Kortylewski, A., Schneider, A., Gerig, T., Egger, B., Morel-Forster, A. and Vetter, T., 2018. Training deep face recognition systems with synthetic data. arXiv preprint arXiv:1802.05891.

[4] Peng, X., Sun, B., Ali, K. and Saenko, K., 2015. Learning deep object detectors from 3d models. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1278–1286).

[5] Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I. and Schmid, C., 2017. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 109–117).

The DownLinQ

Welcome to the archived blog of CosmiQ Works, an IQT Lab