Self Labeling Via Simultaneous Clustering and Representation Learning

Vedant Ghate
Machine Intelligence and Deep Learning
10 min readApr 26, 2022

Blog by Vedant Ghate and Ashutosh Avadhani

A video explanation of this blog is available:

In the world of Deep Neural Networks, learning from the available data is a primary and crucial step. The learning phase can be of three types viz. Supervised, Semi-Supervised, and Unsupervised learning. Unsupervised learning has been widely studied in the Machine Learning (ML) community and algorithms for clustering, dimensionality reduction, or density estimation are regularly used in Computer Vision (CV) applications. Unsupervised learning operates on learning from unlabeled data. This can dramatically reduce the cost of deploying machine learning algorithms to new applications, thus amplifying their impact on the real world. It is possible to adapt unsupervised methods based on density estimation or dimensionality reduction (like clustering) for deep models, however, these methods offer a poor performance when they are scaled for larger datasets. Clustering methods were primarily designed for linear models on top of fixed features, and they don’t perform well when features are to be learned simultaneously.

Representation Learning

The process of learning a deep neural network while discovering the data labels can be termed simultaneous clustering and representation learning. We can perform clustering of the data by using unsupervised clustering algorithms such as k-means and use the results of the clustering to learn the data labels by using its output as feedback (cross-entropy minimization). But combining representation learning with clustering can lead to creating degenerate solutions — solutions that are feasible but at least one of its basic variables is zero. DeepCluster [1] introduced a method that achieved excellent results in unsupervised representation learning. However, it doesn’t optimize the learning, instead, it avoids the occurrence of degenerate solutions by adopting a certain style of implementation. To overcome these problems, a paper published by Yuki Asano [2] introduces us to a new method that provides a principled formulation for simultaneous clustering and representation learning.

In this approach, we proceed as follows:

  1. Minimize the cross-entropy loss for learning the deep network and estimate the data labels — This step is done in semi-supervised learning, however, if we use the same concept for unsupervised learning, we might land on degenerate solutions. To solve this issue, we proceed to step 2
  2. Add a constraint that the labels must generate equipartition of the data — This maximizes the information between data indices and labels. It results in an assignment problem that is analogous to an optimal transport problem, and hence it can be solved using linear programming. The problem in question is for large datasets, where there are millions of labels and data points for which the standard transport algorithms might fail. To solve this issue, we proceed to step 3
  3. Use a faster version of the Sinkhorn-Knopp algorithm to solve the transport problem for a large number of data points — The use of matrix-vector algebra is done for faster execution

Need for a self-labeling algorithm?

We are using a method where class scores are mapped to the class probabilities via the softmax operator. The model and head parameters are learned by minimizing the average cross-entropy loss, and this type of model requires a labeled dataset. When labels are not available, we NEED a self-labeling algorithm.

Let’s walk through the method in detail

Self-labeling:

We start by defining a Deep Neural Network as follows:

Model Representation

Using a softmax operator, the class scores are mapped to class probabilities. The learning for the network is done by minimizing this cross-entropy loss. Our objective, thus comes out to be:

In semi-supervised learning, minimizing our current objective requires labeled data, however, it is not possible to label the data when the volume is in millions. This is why we need to make use of unsupervised learning.

Minimizing the entropy loss in unsupervised learning will lead to a degenerate solution, where, the optimization will be trivially achieved by assigning all data points to a single random label.

To address this issue, the labels are encoded as a posterior distribution q(y|xᵢ). Using posterior distribution, our objective can be written as:

Assuming that we set the posterior distribution as deterministic, the goal is now to optimize the q, but the problem persists because optimizing q is the same as reassigning the labels. So, to avoid this issue, we add a constraint that each data point will be assigned to only one label y, thus the total number of data points N will be uniformly grouped into exactly K clusters*. So, our learning objective now is to minimize E(p, q), as follows:

Now that we have formalized the learning objective, we notice that even though we are selecting the elements from a large group without any regard for their arrangement, they are combinatorial, hence very difficult to optimize. So, to optimize our problem, we would be transforming the learning objective into an instance of an optimal transport problem.

*This is an equipartition condition, where a constraint is set that the label assignment must partition the data into equally-sized subsets of [N/K] or [N/K]+1

The Optimal Transport Problem

The optimal transport problem (OTP) gives us a framework where one essentially pays a cost for transporting one measure to another. In layman’s terms, we can say that OTP is basically how we transport the probability of measure A to the probability of measure B while minimizing the cost ( c ) of transporting one unit of mass from A to B.

In our learning objective, we can consider Pᵧᵢ as a set of joint estimated probabilities which can be our measure ‘A’ and Qᵧᵢ as a set of joint assigned probabilities. Using the notion of the OTP (presented in [3]) we can find the matrix Q which is a set of all possible tables that will satisfy a set of given margins. Our objective will now be transformed to:

where 〈·〉is the Frobenius dot product (a.k.a inner dot product)

By using the principles of OTP, we can reduce our objective to linear time complexity. But traditional algorithms are not able to scale large datasets. Therefore, to scale and solve the transportation problem we use a version of the Sinkhorn-Knopp algorithm using a regularization term ‘KL’ which is also known as Kullback-Leibler divergence:

where KL is the Kullback-Leibler divergence and rcᵀ can be interpreted as a K×N probability matrix. The sinkhorn-Knopp algorithm is usually used for convergence and the advantage of the KL term is for minimizing the above equation, the minimizer for that can be written as:

The value of λ is a tradeoff between convergence speed with closeness to the OTP, therefore we keep it constant so that we can achieve the final clusters in a more optimized way. The values of α, ß (both are scaling coefficients) are chosen in such a way that the resulting matrix Q is a probability matrix. In practice the values of α, ß are retained in every cycle thus increasing the optimization of our problem.

The algorithm core:

The algorithm, finally, is a process of learning the model and assigning the labels by solving the optimization problem:

It is achieved by alternating between the following two steps:

  1. For a current label assignment Q, we perform updates to our model (h ◦ Φ) by minimizing our objective with respect to the parameters of our model
  2. Given the current model (h ◦ Φ), we compute the log probabilities P and then calculate label assignment Q by iterating the below updates:

Each update involves a single matrix-vector multiplication with complexity O(N K), so it is relatively quick even for millions of data points and thousands of labels.

Implementation of the Self-Labeling (SeLa) technique and optimal configuration

Architecture setup

  • Linear Probes: — Linear probes is a technique where we train a linear classifier on top of a pre-trained feature representation which is kept fixed
  • Data: — Training data consists of ImageNet LSVRC-12 and other small similar datasets
  • Architecture: — The base encoder used for this experimentation is AlexNet and for back-propagation we use ReLU. The probes are injected after each relu layer. With 5 layers denoted as conv1 to conv5, the first two layers conv1 to conv2 can be learned easily from data augmentation, while the next 3 (deeper) layers are more impactful towards the performance analysis.

Optimal configuration and ablations

  • Changing the value of number of clusters — Larger number of clusters (K) slightly decrease the performance
  • Changing the number of heads — Increasing the value of heads (T) yields a performance gain, which occurs to be between 2% (for AlexNet) to 10% (for ResNet)
  • Number of self labeling steps — Self labeling steps are essential for good performance. Varying the number of times the self labeling algorithm is run during training, the best value occurred around 80
  • Architectures — SeLa works well with various standard architectures such as Alexnet and ResNet

Implementation Results

All the models are trained using SGD with the initial learning rate of 0.05 for 400 epochs reducing the learning rate at 150, 300, and 350 epochs. The value of λ is kept at 25.

The following table compares clustering metrics obtained using the trained models. They are compared against the ground truths of ImageNet validation set. These metrics include the adjusted Normalized Mutual Information and adjusted Random Index. Top-1 Accuracies are obtained using the ImageNet Linear Probing.

Comparison of clustering metrics

The following graph shows the distribution of cross-entropy of pseudo labels with regards to the ImageNet Labels changes with training time. It helps us in understanding how well the input images are getting clustered.

Cross-entropy of pseudo-labels

We observe that at the start, for random real ImageNet labels, a high entropy is obtained but towards the end of training we arrive at a very low value of entropy. The following figures show the random samples of images associated with the lowest entropy pseudo labels.

The following are the visualizations of the validation sets. They have been sampled with labels having the lowest entropy pseudo-classes of the training sets.

The next two sets of figures have been obtained for random label visualization. In these sets of images, the first is obtained through randomly sampling the ImageNet training set which is associated with random pseudo-classes. The next image is obtained through random sampling of the validation set images.

How Self Labeling (SeLa) solves the problem?

The objective of optimizing a loss equation — for the setting where feature vectors are extracted by the neural network by processing on the input data (clustering) — is to minimize the reconstruction error of the clustering algorithm. The problem is that this minimization can be obtained by flushing all the data points into one cluster, thus forcing all the means to coincide with a single feature vector. DeepCluster [1] makes use of this method, and they successfully avoid the possibility of having a degenerate solution by establishing an interaction between the representation learning and clustering steps. They do so by fixing the feature vectors such that the K-means don’t coincide to a single feature vector, and also by fixing the label assignments during the classification. In contrast to this method, where convergence cannot be guaranteed due to the lack of having a well-defined objective, SeLa uses a single, well-defined objective for both — Representation Learning and Self Labeling — by reducing the objective to an optimal transport problem.

Learning formulation of SeLa can be interpreted as maximizing the information between data indices and labels while explicitly enforcing the equipartition condition, which is implied by maximizing the information in any case. Compared to minimizing the entropy alone, maximizing information avoids degenerate solutions as the latter carry no mutual information between labels y and indices i. DeepCluster also avoids the degenerate solutions but it cannot guarantee the convergence, which is why SeLa is advantageous over DeepCluster.

The Conclusion

Self Labeling (SeLa) is State-Of-The-Art

  • A self-supervised feature learning method that is based on clustering
  • Optimizes the same objective during feature learning and clustering
  • Resulting self-labels can be used to quickly learn features for new architectures using simple cross-entropy training
  • Outperforms all other feature learning approaches and achieves SOTA

References:

[1] Deep Clustering for Unsupervised Learning of Visual Features, Caron et al., 2018 https://doi.org/10.48550/arXiv.1807.05520

[2] Self Labelling Via Simultaneous Clustering and Representation Learning, Yuki Asano, 2020 https://doi.org/10.48550/arXiv.1911.05371

[3] Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances, Marco Cuturi, 2013 https://doi.org/10.48550/arXiv.1306.0895

--

--