What if one could use limited labelled data from a small region (e.g. Slovenia, highlighted) and train a classifier that generalises to the whole continent using unlabelled data? Pictured: sample locations from EuroSAT [5].

Semi-supervised learning in satellite image classification

Experimenting with MixMatch and Earth observation data

Jernej Puc

Published in

Sentinel Hub Blog

13 min readFeb 18, 2020

Introduction

Machine learning is a field that is appreciably useful in Earth observation (EO) — it enables land cover classification, cloud detection and more.

Focusing on the learning part itself, the most straightforward option is to fully rely on supervised learning, by which algorithms can gradually improve, based on the given sample set of input-output pairs, to approximate a general input-output mapping.

In the age of “Big Data”, raw data (the inputs) is often easily obtained in very large quantities for a specific sample set, while making sense of it (procuring the desired outputs) can be laborious and expensive. With petabytes of Sentinel-2 image data widely available, but no efficient way to produce accurate corresponding labels in meaningful amounts, EO faces the same situation.

This can be a debilitating problem, since modern deep learning algorithms typically require a lot of labelled data in order to generalise properly. However, one is not entirely helpless in this predicament, for there is an entire subfield that aims to achieve “more with less” and make the most of the data that is at hand.

In the absence of (labelled) data

Tapping into vast unlabelled resources was not always feasible, as existing techniques could even harm the asymptotic performance of training. Nonetheless, more recent semi-supervised learning (SSL) methods allege substantial improvements over the fully-supervised baseline [1].

Looking beyond gains in accuracy alone, an intuitive and potentially remarkable possibility presents itself for EO in particular. With image classification as an example: if we have different sets of raw data for different geographic regions, but labels for only one of them, can we introduce the rest in an unsupervised manner and expect our trained model to be accurate for those as well?

This was a major theme behind the following experiments.

On the methods in consideration

Despite the growing interest for SSL in computer vision, its application to EO is currently limited to specific use-cases [2]. Taking a broader look, it can be observed that many recent SSL methods leverage unlabelled data in the same way — by adding a loss term to the objective function — although they vary in approach and alleged benefits [1].

Typically, they fall into one of three classes:

consistency regularisation, encouraging the model to produce the same output distribution when the inputs are perturbed using data augmentation,
entropy minimisation, encouraging the model to output confident predictions on unlabelled data, and
traditional regularisation, discouraging the model from overfitting to the training data.

Several methods were evaluated and compared by Oliver et al.[1], and unified in MixMatch [3], an algorithm that seeks to combine each of the aforementioned categories. The team at Google Research has made the code for MixMatch publicly available to use and test and since it was shown to be greater than the sum of its parts [3], it seemed the most sensible of the state-of-the-art methods to focus on.

MixMatch

MixMatch operates on a per-batch basis. Given a batch of labelled samples X and an equally-sized batch of unlabelled samples U, MixMatch produces a processed batch of labelled samples X’ and a processed batch of unlabelled samples with guessed labels U’.

X’ and U’ then determine their respective loss terms: the typical cross-entropy loss between labels and model predictions for X’ and the squared L_2 loss between guessed labels and predictions for U’. The latter corresponds to the multi-class Brier score, which is bounded and less sensitive to completely incorrect predictions, making it useful for the case of unlabelled data.

Here, x’ and p refer to a labelled sample with its one-hot encoded label (representing one of L possible classes), u’ and q to an unlabelled sample with a guessed label, p_model(y|x; θ) refers to a generic model, which produces a distribution over class labels y for an input x with parameters θ, and H(x,y) is the cross-entropy between distributions x and y.

The combined loss function is a weighted sum:

where w_match is an experimentally determined hyper-parameter with the purpose of bringing the two terms in line with one another.

The example (left) depicts how, with an appropriate value of the hyper-parameter w_match, the supervised and unsupervised loss values vary in a comparable fashion. When unaided by the unsupervised loss term, the loss value quickly drops towards zero (right), making the model more prone to overfitting.

Consistency regularisation

MixMatch achieves consistency regularisation through stochastic transformations of the input data, while the associated labels remain unchanged. Applying this data augmentation to labelled samples can artificially expand the size of the training set by generating a stream of new, modified data, enforcing model invariance to the employed transformations.

In computer vision, this is commonly achieved by random spatial translation, rotation, elastic deformation, flipping, cropping, brightness or contrast adjustment, and adding noise.

Examples of data augmentation, typical in computer vision, on a satellite image.

Similarly, label guessing is introduced through the notion that an unlabelled sample should be classified the same even after it has been augmented. Each unlabelled sample in U is thus augmented K times. These K samples u_a,k are then fed through the classifier and the predictions are averaged to produce a guessed label, which applies to all K augmented versions of the initial unlabelled sample:

Entropy minimisation

In generating a label guess, MixMatch performs an additional step to reduce the entropy of the label distribution. It does this through the use of a “temperature sharpening” function:

where indices i and j denote class components for distribution over L classes. As the T hyperparameter approaches 0, the output will approach a Dirac (one-hot) distribution. Lowering the “temperature” thus encourages the model to produce lower-entropy predictions.

Diagram of the label guessing process used in MixMatch from the original paper [3].

Traditional regularisation

Regularisation of model weights is commonly used to increase the stability of the network and its ability generalise to unseen data, since “large weights tend to cause sharp transitions in the node functions and thus large changes in output for small changes in the inputs”, rendering them susceptible to noise.

MixMatch employs two regularisation methods. The first is an explicit weight decay, exponentially decaying the weight values towards zero. For the n-th iteration:

The second method, MixUp [4], encourages the model to have strictly linear behavior “between” samples, so that the model’s output for a convex combination of two inputs is close to the convex combination of their respective outputs.

Specifically, previously augmented labelled samples X_a and unlabelled samples with guessed labels U_a are mixed, i.e. combined and shuffled, to form W. W is then split into W_X and W_U of sizes |X_a| and |U_a|, respectively (note that U_a is K times the size of U due to multiple augmentations per each unlabelled sample). For each pair of samples (x,y) with labels (p,q) in X_a, W_X and U_a, W_U, the following is performed:

i.e. a number is sampled from the beta distribution parameterised by α and ensured to lie between 0.5 and 1 (giving true labels more credence), followed by computing the convex combination of the two samples and their labels. Thus, the processed batches X’ and U’ are formed.

Beta distribution sampling for different values of alpha. Higher values lead to more “mixing”, i.e. the convex combination of two samples will be less likely to lean towards either of them, but will instead be closer to their average.

Experiment set-up

To estimate the feasibility of SSL in EO, MixMatch was applied to image classification of satellite imagery and compared to fully-supervised and transfer learning baselines.

For the experiments, the EuroSAT data set [5] was used, providing 27,000 single-labelled and geo-referenced Sentinel-2 satellite images in RGB or multi-spectral form at 10m resolution (with interpolated lower-resolution bands).

Data set preparation

The images, of size 64 by 64 pixels, are divided into 10 distinct and relatively balanced classes, as each is comprised of 2,000, 2,500 or 3,000 samples.

Sample locations for each of the 10 classes in EuroSAT, colour-separated by latitude into 10 equally-sized splits.

When dividing the data set into smaller splits for training and later validation, this balance was important to preserve, but, as can be seen from the figure above, the classes greatly differ in their regional distribution.

Usually, the classifier is trained on data covering all regions, on which it is to be used. Splits are constructed by random sampling separately within each class, each split approximating the regional distributions of the full data set.

Sample locations for each of the 10 equally-sized splits, where samples of different classes were distributed across splits randomly.

In an attempt to simulate regional variety for the purposes discussed in the introduction, the samples within each class were not distributed across splits randomly but rather separated by latitude. Due to the differences in regional distributions between classes, however, balanced class representation could not be achieved without some degree of regional overlap.

Sample locations for each of the 10 equally-sized splits, where samples of different classes were distributed across splits based on latitude.

The data set was divided into 100 splits of 270 samples each (20 to 30 samples per class) for each method of split construction. The latter also dictated how the splits were grouped:

Random sampling: the splits could be grouped arbitrarily for any purpose.
Latitudinal separation: the labelled set was formed by the northern/southern-most splits, the test set by the southern/northern-most splits, while the rest formed the unlabelled set.

The number of splits that were used to form the labelled and unlabelled training sets varied according to the objective, while the test set always consisted of 10 such splits (10% of all available data).

Details of implementation

In their experiments, the authors of MixMatch used a deep residual network model, specifically the Wide ResNet-28-2. With approximately 1.5 million total parameters, that is already much simpler than many standard models. Since high accuracy on EuroSAT was achieved at even lower complexities, the model was restricted to a regular ResNet-28, with approximately 370,000 parameters, to hasten the learning process, as the experiments were conducted on a low-cost single-GPU instance.

Starting from the published MixMatch implementation, the following hyperparameter values were used:

filters (related to ResNet width): 16
scales (related to ResNet depth): 3
beta (α): 0.45
w_match: 100
wd (w_decay): 0.01

Other parts of the configuration were mostly left at their defaults, with a few important notes:

The data was rescaled from the value range of [0, 255] to [-1, 1] and further standardised by subtracting the mean and dividing by the standard deviation (per each split).
A sequence of flipping, rotation and brightness adjustment was used to augment the data.
The weights of the network were initialised using the He uniform variance scaling initialiser [6].
The model was trained for 48 epochs consisting of 256 batches with 64 images per batch.

Standardisation and initialisation were shown to be especially important, otherwise, convergence of the loss was delayed and a peculiar artifact emerged in the early stage of training, as can be seen in the accuracy trend below:

Model performance on a split with 2,700 samples, trained via fully-supervised learning.

Fine-tuning of a pre-trained model

A popular alternative to SSL is transfer learning, where models, pre-trained on large collections of natural images, are fine-tuned to a relatively smaller set of labelled data. This experiment was aimed at comparing the performance of the two methods.

Settling on a particular pre-trained model turned out to be difficult, as most publicly available models are larger than ResNet-28. Of those, MobileNetV2 [7] is among the smallest and most comparable in this regard. It comes pre-trained on ImageNet for various input sizes, but not for 64x64 pixels, the size of EuroSAT images. Therefore, the data was upscaled to 128x128 pixels, for which size the corresponding model weights were at hand. With approximately 2.3 million parameters, the model is still about 6 times larger than ResNet-28 — if the latter were to perform better with MixMatch than the larger model without, the implications should remain clear.

In this experiment, the data was augmented and standardised as well. Similarly, fine-tuning consisted of 48 epochs but divided into two consecutive phases. Typically, the pre-trained component is first “frozen” (set to untrainable) and only the top/head of the model is trained, while a few more layers are released for the second phase of fine-tuning.

Details of fine-tuning phases. The learning rate applies to the Adam optimiser.

Note: When it comes to transfer learning with Keras and Tensorflow, there is a persistent issue that pertains to pre-trained models that include batch normalisation layers, which might have significantly influenced the final results. While elaborated on and being a frequent source of problems, it does not appear to be adequately resolved at the time of writing. A band-aid solution forces the model to use batch statistics during inference time, therefore, the reported test scores were obtained by considering the whole test set as a single batch.

Results

Convergence

The primary experiment was to observe how each learning method behaves as the number of unique labelled samples in training is increased:

Overall accuracy with regard to the number of unique labelled samples for each learning method. The plots are semi-logarithmic for the purpose of clarity.

In the random distribution case, it can be seen that MixMatch performs consistently well, even when labelled data is scarce, while fully-supervised and transfer learning slowly converge towards a similar score. This is an important insight, as it means that the amount of labelled data can be greatly reduced without significantly affecting the model’s performance.

In the latitudinal separation case, both transfer learning and MixMatch generalise to unseen regions much better than the fully-supervised baseline, with MixMatch retaining the edge. Gains in performance are substantial and can mean the difference between a usable and an unusable model.

Deviation

To assess how this optimistic result varies with the choice of splits that form the labelled, unlabelled and test sets, 5-fold validation was performed for the random distribution case and 4-fold for the latitudinal separation case (due to its polarisation). The labelled and unlabelled sets consisted of 10 and 80 splits, respectively.

In the random distribution case, the deviations are small, but grow to be considerably large in the case of intended separation. Interestingly enough, transfer learning appears to be the most consistent in this regard.

Note that MixMatch, even at its worst in the latitudinal separation case, performs as well as transfer learning does on average. Thus, its standing among the methods is reinforced.

Notes

Transfer learning reached the area of saturation much faster, despite being applied to a larger model.
Transfer learning achieving lower scores in the random distribution case could, perhaps, be attributed to the aforementioned batch normalisation issue. Alternatively, it could itself be limiting the asymptotic performance, as tends to happen with auto-encoders.

Conclusion

Due to the time frame and naturally limited scope of the heeded comparisons, there remain many methods in the literature that were not closely considered (e.g. Unsupervised data augmentation [8]).

Nonetheless, the results indicate that models trained via semi-supervised learning — specifically, MixMatch — are at least comparable to those trained with solely-supervised methods using the same amount of labelled data and notably outperform them, when very few labels are available. Performance on latitudinally separated data, in particular, seems to hint at MixMatch’s generalisation capabilities, which has significant implications for Earth observation.

Since the conclusion of the presented experiments, the MixMatch team at Google Research had already followed-up on their publication with ReMixMatch [9] and FixMatch [10]. These new iterations are alleged to perform even better in low-label scenarios, approaching the domain of few-shot learning, which demands further attention.

Many more experiments can be done to uncover the potential of this family of powerful methods in the field of EO. Whether trying out a different data set (e.g. BigEarthNet [11]) or extending towards a different application (e.g. semantic segmentation), our experiments with semi-supervised learning are bound to continue.

Eager to experiment with SSL? If you are interested in our work or Earth observation in general, you can join us and contribute to our research endeavors.

References

[1] Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk, Ian J. Goodfellow. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

[2] Prem Shankar Singh Aydav and Sonajharia Minz. Semi-Supervised Learning for the Classification of Remote Sensing Images: A Literature Review. Krishi Sanskriti Publications, Advances in Computer Science and Information Technology (ACSIT). Volume 4, Issue 1; January-March, 2017, pp. 10–15.

[3] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, Colin Raffel. MixMatch: A Holistic Approach to Semi-Supervised Learning. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

[4] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz. mixup: Beyond Empirical Risk Minimisation. arXiv preprint, Machine Learning (cs.LG).

[5] Patrick Helber, Benjamin Bischke, Andreas Dengel, Damian Borth. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv preprint, Computer Vision and Pattern Recognition (cs.CV).

[7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.

[8] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. Unsupervised Data Augmentation for Consistency Training. arXiv preprint, Machine Learning (cs.LG).

[9] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, Colin Raffel. ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv preprint, Machine Learning (cs.LG).

[10] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, Colin Raffel. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. arXiv preprint, Machine Learning (cs.LG).

[11] G. Sumbul, M. Charfuelan, B. Demir, V. Markl. BigEarthNet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding. IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 5901–5904, Yokohama, Japan, 2019.