Few Shot Geospatial Deep Learning — Part 2

Karthik Dutt
GeoAI
Published in
7 min readSep 1, 2022

In part 1 of this blog series, we saw how few-shot learning can benefit us when the number of labelled training samples is very limited. However, we saw that the few-shot learners still need a considerable amount of labelled data during the pre-training phase, where a base model is trained on a labelled base dataset. The value of few-shot learners is however realized in the finetuning step, where they can be trained using a very limited number of samples belonging to the novel classes.

This approach works very well when we are interested in classifying or detecting objects which are similar to the objects present in the base dataset. For example, when the task in front of us is to classify the specie of an animal captured in camera traps and we have very limited number of samples belonging to each specie. Few shot learning works well in such cases, as the object that we are interested in, is not too dissimilar to what the model had seen during the training phase.

However, consider the cases where we are interested in classifying a medical condition based on X-Ray images, or when classifying objects seen in Sentinel imagery shown below. X-Rays and satellite images are visually very different from images of real-world objects which the model was pretrained on. In such cases, the few shot learner might not perform as well as we would expect.

In such cases, we would need to pre-train the few shot learner using a base dataset that consists of labelled images that belong to a samiliar domain. Well, the problem is that it is not easy to obtain labelled datasets belonging to these domains, and with that we are back to square one.

To get over this problem, we will need to ensure that labelled training data is not needed in the pre-training stage of Few shot learning. Self-supervised learning solves this problem!

Self Supervised learning

To recall, we had seen in the previous section how few-shot learners use supervised learning to train the feature extractor (albeit on a base dataset, which did not include our novel classes) during its first phase before finetuning. We will need to get rid of this supervised learning phase and replace it with an approach that does not need labelled data. The feature extractor should still be able to quantify how similar two images are. Such an approach would give us few shot detectors that can work equally well with satellite imagery or other domains where data is dissimilar to Imagenet data.

Having understood the motivation and the advantages of having a learner which does not need labelled dataset even during the pretraining step, let us now dig deeper and understand how self-supervised learners actually work.

The central idea in a self-supervised learning is to derive supervision for learning from the data itself, and not rely on manual labels for this. In this approach, the labels are generated from the unlabelled data itself and during the process of learning, the learners learn to predict the same labels back. This is also termed as solving a pretext task.

Some of the examples of pretext tasks are:

1. Learning to predict if two random crops of images come from the same image or from different images.

2. Learning to predict the relative positions of the two random crops of images.

The type of pretext task used during self-supervised learning depends on the downstream task which the final model is intended to perform. It has been found that certain types of pretext tasks that work well for a downstream classification task do not work as well for downstream object detection tasks.

Let us now look at some common pretext task approaches that are used for downstream classification tasks:

a. Contrastive learning approach — In this approach, the objective of learning is to maximize the similarity between two augmented views belonging to the same image and reduce the similarity between the two views belonging to different images.

Some of the more popular implementations that use this approach are MoCo and SimCLR

b. Clustering based approach. —In this approach, the image representations are learnt and clustered. The predicted clusters are in turn used to learn better representations. The generation of image representations and prediction of clusters are alternated during the learning phase.

Swapping Assignments between multiple Views of the same image (SwAV) is a popular and recent implementation that relies on this approach.

Now, let us look at some of the more popular and state of the art implementations of self-supervised learning for object detection.

1. DETReg — In DETReg, the detection heads are pretrained by re- predicting the position of crops which are generated automatically using selective search of regions with similar local characteristics, like texture or color.

2. DenseCL — DenseCL trains the detection heads of Faster RCNN to output representations such that aligned sliding windows across two views of the same image have similar or closer representations and unaligned sliding windows of two views of same image have dissimilar representations.

To summarize, there are various techniques available to train a model in a low data regime. The figure below helps in understanding the same.

With so many approaches available, it is important to understand which approach is best suited for the data that we are working with and the downstream activity that we intend to perform. The table below has some recommendations.

The approach that needs to be used for training is clear in cases where the objects to be detected or classified are either very similar to Imagenet data (eg, identify animal species) or completely dissimilar to Imagenet data (eg, medical images).

However, the approach that we need to take in case when we are working with images which are not completely different from Imagenet data is ambiguous (eg, — Satellite images in RGB bands with visible buildings, roads, trees). We performed some experiments to try and figure out the approach that works well.

Some of the questions for which we will try to answer are:

1. When using supervised learning, does a backbone trained on domain specific data using self-supervision perform better than a backbone trained on Imagenet?

2. Does a few shot learner with a backbone trained on Imagenet using self-supervision, perform better than one using a backbone trained on Imagenet using supervised learning?

To answer the first question, we trained two self-supervised ResNet-50 backbones on a million RGB satellite images at resolutions of 0.3m, 0.6m and 1m. Of the two backbones, one was trained using SimCLR and the other using SwaV. We used these backbones as a feature extractor for a FeatureClassifier which was intended to classify buildings that were damaged after a fire

We trained three feature classifiers using the arcgis.learn api. Two of these three feature classifiers had ResNet-50 backbones which were trained as described before, and the third feature classifier had a ResNet-50 backbone trained on Imagenet data.

We found that the ResNet-50 backbone which was trained on Imagenet data outperformed the other two models significantly. The take home message for us was that at least for RGB satellite imagery, there is no benefit in training the backbones on satellite imagery, and the backbones trained on Imagenet data are good enough. The screenshot below shows a side-by-side comparison of the performance of two of these classifiers.

To answer the second question, we trained a couple of object detectors to detect palm trees as shown below.

We chose DETReg object detector for this exercise. The first of the two object detectors had a ResNet-50 backbone which was trained using self-supervision on Imagenet, while the other one was trained using supervised learning on the same dataset. (For those who are wondering the reason for choosing DETReg, please refer to this sample notebook where we show how few-shot object detectors like DETReg which is built in arcgis.learn api helps us to get accurate models even when we have very less data) . We found that the DETReg object detector that used the semi supervised ResNet-50 backbone trained using SwAV approach performed better compared to the supervised ResNet-50 backbone, albeit by a very small margin.

The message that we takeaway from this exercise is that, while dealing with limited/less satellite imagery training data, few shot detectors with backbones trained using self-supervised learning approaches gives the most bang for the buck.

References and recommended reading:

https://developers.arcgis.com/python/api-reference/arcgis.learn.toc.html

--

--