Self-Supervised Classification: Semantic Clustering by Adopting Nearest Neighbors

A 2020 approach to orthodox classification paradigms

Mahima Modi
VisionWizard
6 min readJul 5, 2020

--

Ironically neural networks that claims to reduces manual hardship, itself requires manually annotated supervised datasets. This manual annotation of a dataset eats up most of the hours and days of the training process.

The paper “Learning To Classify Images Without Labels” proposes a solution to this tedious hardship. In this article, we will try to breakdown the method proposed by the paper. The article will also cover different experiments performed and their observations.

Table of Content

  1. Introduction
  2. Proposed Algorithm
  3. Experiments and Results
  4. Conclusion
  5. References

Introduction

  • Image Classification models typically are trained on a supervised dataset. Where input images have labels assigned to them for the network to learn features. But some recent self-supervised approaches have emerged for classification such as (i) two-stages pipeline methods and (ii) End-to-end methods.
  • In the two-stage pipeline method, stage one uses representation learning for features extraction from a neural network and the second stage includes fine-tuning this network with supervision to check if it was trained properly. Then finally clustering(K-means) is used on those features which have to optimize criteria independently.
  • The end-to-end approach has feature extraction and clustering combined in one pipeline. The issue in these methods is that clustering is vulnerable to initially learned features (low-level features).
  • This paper’s method discards the dependency of labeled data on semantic classification model training. Paper also eliminates the need to know the number of classes beforehand. Furthermore, the authors prove that other parameters used in this method also don’t affect significantly to the model.

Proposed Algorithm

0. Model Overview

In contrast to the current trend of end-to-end models, this paper advocates a two-stage method.

Fig-0: Model pipeline
  1. Representation learning where pretext task is used for feature embedding. Based on pretext output the semantically meaningful nearest neighbors are mined from each image.
  2. Then train a neural network with a loss function(section-3 explains it). Instead of ground truth labels each image and its mined neighbors together are used for training.

1. Representation learning: A pretext task

In representation learning, A pretext task learns an embedding function Φ_θ — parameterized by a neural network with weights θ — that maps images into feature representations, in a self-supervised fashion.

Pretext tasks are neural networks used for learning a particular task such as image-coloring, affine transformation, instance discrimination, etc. Pretext task produces high-level features that are invariant to low-level features of images (e.g color, contrast, texture, etc).

How to pick a pretext? A pretext task that minimizes the distance between features embedding of X_i and its augmented image(random cropping, Image flipping, etc.) T[X_i] (as shown in Fig-2)can be used as before semantic clustering.

Fig-2: Equation for Pretext; Source[Link]

This paper suggests using instance discrimination[2] as pretext for semantic clustering.

It is beneficial to select a pretext task that imposes invariance between images and their augmentations.

2. Mining nearest neighbors

Representative learning, where model Φ_θ is trained to solve pretext tasks. Then for every X_i in dataset N_xi are mined, based on embeddings from pretext task. Refer Fig-3 for mining neighbors’ flow.

Fig-3: Semantic Clustering Dataset based on pretext embedding

This process mined data will look like the one as given in Fig-4

Fig-4: Images (first column) and their nearest neighbors (other columns) sampled under pretext task; Source[Link]

Fig-5 shows by how much does the same cluster images are mined nearest neighbors.

Fig-5: Neighboring samples tend to be instances of the same semantic class; Source[Link]

3. Clustering: A semantic clustering loss

Now that we have Xi and its mined neighbors N_xi, the aim is to train a neural network Φη which classifies them(Xi and N_xi) into the same cluster. The weights of Φη gets updated by minimizing the loss function given in Fig-4.

Fig-6: Semantic Clustering by Adopting Nearest neighbors- Loss Updation

In Fig-6, in the loss function, first term, <·> denotes the dot product operator.

Every clustering method mainly focuses on minimizing intra-cluster distance and maximize inter-cluster distance

  • Hence the first term here tries to minimize intra-cluster distance i.e to make a consistent prediction of classifying Xi and N_xi in the same class.
  • To avoid classification of all inputs into the one cluster second term entropy is introduces. It makes sure that predictions are uniform across the cluster.

4. Fine Tuning: Self-Labeling

The clustering network still has false positives but with low certainty. So paper performs the self-labeling step to make the network more certain.

  • During training confident samples are selected by thresholding
    the probability at the output, i.e. pmax > threshold. (obviously, the most certain samples are considered)
  • Selected samples are given pseudo labels(one they were classified into). and strongly augmented versions(helps avoid overfitting) of the confident samples are generated for further training.
  • Cross-Entropy loss is used to update the weights of the network.

Experiments and Results

Datasets Augmentations

The model was tested on various datasets:
→CIFAR10
→STL10
→CIFAR100–20
→ImageNet-1000

The augmentation process:

  1. Standard data augmentations are random flips, random crops, and jitter.
  2. Strong augmentations are composed of four randomly selected transformations from AutoAugment.
Fig-7: Image augmentation comparison influence on model accuracy; Source[Link]

Fig-7 shows that applying strong augmentations to the samples and their nearest neighbors further improve the performance of the model and impose invariance in the dataset.

Pretexts

For pretext tasks paper considers some such self-supervised feature learning networks:

  1. Noise-Contrastive Estimation(NCE): Instance discrimination
  2. RotNet: Trained to predict image rotations.
  3. Feature Decoupling: Jointly tackles instance discrimination and rotation prediction

Rotation discriminates even samples and its augmentations which in turn increases their distance. Hence we use NCE for pretext tasks. Fig-8 shows accuracies results for each pretext task used.

Fig-8: Pretext task accuracy results; Source[Link]

K-Nearest Neighbors

Since we are using K-nearest Neighbours obvious question comes is what K is appropriate for the clustering dataset.
→ K =0 means clustering only samples and its augmentations together.
→K ≥1 captures more of the cluster’s variance and has chances of increasing noise i.e. not all samples and their neighbors belong to the same cluster.

Fig-9: Influence of the used number of neighbors K on model accuracy; Source[Link]

The experiments conducted as shown in Fig-9 the classification model is not very sensitive to the value of K but for K = 5 the model performs significantly improves even at the cost of noise inclusion.

OverClustering

Well, the papers have set the number of clusters as per the ground truth dataset. However, it will not be a case when we are not given any prior number of classes. Hence an experiment was done where the number of clusters was increased by 2.

Fig-10: Influence of number of C on model

Paper assumes that increased performance on STL10 and CIFAR100–20 is because of higher intra-class variance.

Conclusion

  • The proposed method eliminates prior knowledge requirements of:
    (a)ground-truth semantic labels at training time and
    (b)the number of classes.
  • Strong Augmentation of data helps in improving model performance.
  • The neural network selected for the pretext task should focus on reducing intra-cluster distance rather than discriminating features of samples from their augmentations.
  • Model is not much dependent on other factors like K, cluster estimation. But K≥1 is more recommended as it helps capture a variety in the same class and not rigid to making classifying sample and their augmentations.
  • Data distribution ambiguity, for example, discriminate between different primates, e.g. chimpanzee, baboon, langur, etc can be a bit more tricky. But at the same time model can deal with various backgrounds, scenarios and still classify pretty well.

Thank you for reading the article. I hope as a writer I was able to convey the topic with utmost clarity. Please leave a comment if you have a feedback/doubt.

--

--

Mahima Modi
VisionWizard

Machine Learning || Deep Learning || Computer Vision Enthusiast