Label-efficient self-supervised learning: a case study on satellite imagery

Industrial application of a research project: How to solve vehicle detection from fewer labeled data?

Thomas
Preligens Stories
9 min readJun 29, 2023

--

Co-authored by Jules Bourcier and Gohar Dashyan.

Detections of rocket launchers on satellite imagery. Source : Preligens

Introduction

At Preligens, we invest a lot of time and effort in research, pioneering AI for a safer world. In recent years, due to the multiplication of satellites and space companies, the amount of satellite imagery data has grown much faster than the number of experts able to exploit these images. Thus, it has become more necessary than ever to develop algorithms capable of assisting image analysts. Our company’s motto is to avoid missing information in the abundant flow of remote sensing data by helping image analysts save precious time with automated geospatial data analysis tools.

Mainstream machine learning algorithms for vision tasks rely on supervised models, i.e. they learn a function that maps an input to an output based on example input-output pairs. Such models need a lot of carefully labeled images in order to achieve good performance. However, labeling is expensive and time consuming, and this is especially true in the defense sector for three main reasons :

  • Annotating military observables of interest requires highly trained annotators.
  • Some military observables are very rare.
  • Labeling can be extremely time-intensive as it is usually needed to process several kilometers wide areas with very dense presence of observables such as vehicles or vessels.

It would be a shame if this limited labeling capability prevented the use of available data and constrained the model’s performances. At Preligens, we have therefore been inspired by a promising deep learning paradigm called self-supervised learning, in order to make the most of unlabeled images and to maximize what the model can learn from few labeled images. This is why we focused on analyzing label efficiency, which is the capacity of the model to generalize from few labeled samples.

Example of Preligens data for the task of vehicle detection. One can easily feel how difficult it is to label large amounts of images like these. Source: Adapted from satellite images ©Maxar

In a recent paper published to the European Cyber Week’s Conference on AI for Defense, we study self-supervised pretraining on satellite imagery under the lens of label efficiency, and evaluate an approach on an operational use case of object detection, for the first time to our knowledge.
More precisely, we show that in-domain self-supervised pretraining is competitive with the traditional out-of-domain supervised pretraining on ImageNet, and can even outperform it in low-label regimes.

In this story, we briefly introduce self-supervised learning. Then, we describe the self-supervised framework we used, before presenting our promising results on vehicle detection on Preligens proprietary data.

What is self-supervised learning?

The main purpose of self-supervised learning is to learn useful features (also called representations) from unlabeled images. But in order to learn anything, an algorithm needs to solve some form of a prediction problem. Since we do not have ground-truth labels for the task of interest (e.g. for image classification), we define what is called a pretext task, where the supervision signal is provided from the input data itself.

For example, a common method is contrastive learning. From a given image, two random augmentations are produced (e.g. from a composition of random flips, rotations and color distortions), then these two augmentations are fed to a neural network encoder f. When given a pair of images, the model is trained to tell whether it is a positive pair (augmentations produced from the same image) or a negative pair (augmentations produced from different images).

By learning this pretext task, the algorithm is therefore taught relevant representations, which are very useful to recognize similar objects regardless of the orientation or color of the object.

Scheme of the contrastive learning framework. Source: Preligens

Then, these pre-learned (also called pretrained) representations can be used for the task we are really interested in, called downstream task (e.g. vehicle detection and classification). To do this, we transfer the pretrained weights to the downstream task, and fine-tune them on the available labeled data. The downstream task is supervised but it requires less examples than a supervised model learned from scratch, because the model has already learned good representations.

Improving self-supervised methods by making the most of remote sensing specificities

The self-supervised algorithm we have chosen is called MoCoTP (MoCo with Temporal Positives). It improves a famous contrastive learning model, MoCo, by taking advantage of the specificities of remote sensing. Indeed, in remote sensing, we often have multiple images of the same site taken at different time points.

Temporal positives without augmentations. Source: Ayush et al.

Thus, on top of making two augmentations of the same image to produce two positive examples, MoCoTP considers that two views of the same location but at different times are also a positive pair (hence the term Temporal Positives).

Framework of MoCoTP. Source: Adapted from Ayush et al.

Pretraining on a large public dataset

Self-supervised learning consists of two steps, a first phase of self-supervised pretraining on the pretext task, then a second phase of supervised transfer on the downstream task.

For the first step, we have done the pretraining with the self-supervised learning method MoCoTP on the public dataset functional Map of the World (fMoW), but without using the existing labels for functional land use classification. For the encoding part of MoCoTP, we chose the widely used ResNet-50 architecture.

Examples of fMoW images. Source: Christie et al.

Transferring to vehicle detection and classification

For the second phase, the supervised transfer, our downstream task is the detection and classification of vehicles in satellite images. The detection and classification network that we chose for these experiments is the widespread RetinaNet with a ResNet-50 backbone that we initialize with the self-supervised weights pretrained on fMoW. Then, we fine-tune the whole network end-to-end (i.e. we do not freeze any layer).

Target to beat: the traditional supervised ImageNet initialization

Today, a widely used deep learning technique for industrial use-cases is to initialize networks with supervised weights pretrained on the very large ImageNet dataset. Since the images of ImageNet are very different from the satellite images of our dataset used for the downstream task, the supervised pretraining on ImageNet is characterized as out-of-domain, while the self-supervised pretraining on fMoW is called in-domain. This is summarized in the following figure.

Figure presenting the domain gap between ImageNet on one side, and fMoW and our downstream dataset on the other. Source: Preligens

We therefore compare our self-supervised method, called fMoW-MocoTP init, with two baselines:

  • IN-sup init, where the backbone is initialized with ImageNet weights.
  • Random init, where the backbone is initialized randomly (i.e. no pretraining).
Schematic outline of our method. The top block represents the pretraining phase. The bottom block represents the pretrained weights that are injected into the downstream task model. Source : Preligens, Bourcier et al.

Our datasets for vehicle detection

The datasets used for the downstream task are Preligens proprietary datasets containing 8 vehicle categories listed below. In order to study the label efficiency of our approach, we subsample our largest dataset “L” into smaller datasets “M”, “S”, “XS” and “XXS”, targeting respectively 50%, 10%, 5% and 1% of the vehicles present in L (while keeping the original class distribution).

Self-supervised pretraining outperforms supervised pretraining in low-label regimes

For a better understanding of the results, let’s refer to the results on the task of vehicle detection (regardless of the class) as level 1 results, and on the task of joint detection and classification as level 2 results. We computed the commonly used metrics F1-score and average precision (AP) in level 1, and the level 2 mean average precision (mAP), which is the average of the per-class APs.

The smaller the dataset, the greater the performance gain

Level 1 F1-score and level 2 mAP. Source: Preligens

In level 1, what stands out is the label efficiency of fMoW-MoCoTP init, as shown in the figure above. The F1-score of fMoW-MoCoTP init is higher than that of IN-sup init or Random init. Moreover, the smaller the dataset, the larger the gap between fMoW-MoCoTP init and IN-sup init or Random init. On dataset S, fMoW-MoCoTP init’s F1-score is 5.20 points better than Random init, and also 0.40 point better than IN-sup init, whereas on the XXS dataset, fMoW-MoCoTP init’s F1-score is 39 points better than Random init, and 3.7 points better than IN-sup init.

These results show that self-supervised in-domain pretraining can be competitive with supervised pretraining on ImageNet, and even give better results in low-label regimes.

Dominant vs rare classes

Level 2 AP per class. Source: Preligens

In level 2, once again, fMoW-MoCoTP init is better than Random init on all classes. It also achieves higher AP than IN-sup init on the three “dominant classes” (Civilian, Military and Armored), that cover ∼96.5% of the vehicles in our datasets. However, IN-sup init outperforms fMoW-MoCoTP init on the other classes, the “rare classes”, that cover only ~2.2% of the vehicles in our datasets, as presented on the figure below. Since mAP gives equal importance to all classes, this translates into much closer values for mAP scores than level 1 F1 scores between fMoW-MoCoTP init and IN-sup init methods.

As the size of the dataset increases, the performance increases faster for the mAP than for the F1-score because the performance saturates for the dominant classes, whereas on the rare classes, the AP improves significantly with the number of training examples. This result suggests that a sufficient number of training examples is critical to achieve good performance.

Number of observables per class. Civilian, Military and Armored vehicles cover ~96.5% of the vehicles in our datasets. Note that the y-axis is in log scale. Source: Preligens

Is our self-supervised model biased?

One hypothesis is that fMoW-MoCoTP init might be more biased towards the dominant classes. Indeed, the fMoW dataset used for pretraining contains a long-tailed distribution of semantic categories. This might lead to representations being more skewed towards over-represented visual concepts, compared to ImageNet representations, as the latter contains a balanced set of categories. Consequently, such bias may negatively impact the transfer to under-represented classes downstream. However, further work is needed to provide ground for this hypothesis.

Visualization of vehicle detection with IN-sup init and fMoW-MoCoTP init on dataset S. Source : Preligens

How to capitalize on these results?

In order to further explore the bias of fMoW-MoCoTP init, one possibility would be to use a dataset that is more balanced in terms of classes (or deprived of its dominant classes).

These promising results obtained with an in-domain pretraining dataset encourage us to perform a pretraining on our own datasets, because they are larger than fMoW and in the exact same domain as the labeled images used for the downstream task.

When we presented our results to the other teams of the company, many people expressed interest in testing our self-supervised weights. Thus, we integrated our framework in our AI Factory, and the production teams can now initialize their networks with our pretrained weights with a single click. At Preligens, every time the results of the research team are conclusive, the production teams do not have to wait months before being able to use the tools we have developed.

Conclusion

We studied the value of self-supervised learning on a real use-case of satellite image analysis for the defense industry. We have shown that self-supervised pretraining is competitive with supervised ImageNet initialization in the most label regimes and outperforms it in the low-label regime. This approach is particularly suitable for industry use cases, where labeling data is often very challenging.

If you want to learn more about how we explore the state-of-the-art in AI and related fields to be at the forefront of technology in our industry, don’t forget to follow Preligens stories on Medium!

References

--

--