Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Self-supervised learning, semi-supervised learning, pretraining, self-training, robust representations, etc. are some of the hottest terms right now in the field of Computer Vision and Deep Learning. The recent progress in terms of self-supervised learning is astounding. Towards this end, researchers at FAIR have now come up with this new paper that introduces a new method to learn robust image representations.


One of the most important goals of self-supervised learning is to learn robust representations without using labels. Recent works try to achieve this goal by combining two elements: Contrastive loss and Image transformations. Basically, we want our model to learn more robust representations, not just high-level features, and to achieve a certain level of invariance to image transformations.

The contrastive loss explicitly compares pairs of image representations. It pushes away the representations that come from different images while pulling together the representations that come from a different set of transformations or views of the same image.

Computing all the pairwise comparisons on a large dataset is not practical. There are two ways to overcome this constraint. First, instead of comparing all pairs, approximate the loss by reducing the comparison to a fixed number of random images. Second, instead of approximating the loss, we can approximate the task, e.g. instead of discriminating between each pair, discriminating between groups of images with similar features.

Clustering is a good example of this kind of approximation. However, clustering alone can’t solve the problem as it has its own limitations. For example, the objective of clustering doesn’t scale well with the dataset as it requires a pass over the entire dataset to form image codes during training.


To overcome the limitations listed above, the authors proposed the following:

  1. Online Clustering Loss: The authors propose a scalable loss function that works both on large as well as small batch sizes and doesn’t require extra stuff like a memory bank or a momentum encoder. Theoretically, it can be scaled to an unlimited amount of data.
  2. Multi-Crop Strategy: A new augmentation technique that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much.
  3. Combining the above two into a single model that outperforms all other SSL methods as well as pretraining on multiple downstream tasks.


The ultimate goal of this exercise is to learn visual features in an online manner without supervision. To achieve this, the authors propose an online clustering-based self-supervised learning.

But how is it different from typical clustering approaches?

Typical clustering methods like DeepClutsering are offline as they rely on two steps in general. In the first step, you cluster the image features of the entire dataset and in the second step, we predict the clusters or the codes for different image views. The fact that these methods require multiple passes over the dataset makes them unsuitable for online learning. Let us see how the authors tackle these problems step by step.

Online Clustering

  1. We have an image transformation set T. Each image xn is transformed into an augmented view xnt by applying a transformation t sampled from T.
  2. The augmented view, xnt, is then mapped to a vector representation by applying a non-linear mapping.
  3. This feature vector is then projected to a unit sphere, which IMO is just a normalization process. Let’s take a look at the order again:

4. We then compute the codes qnt, for this vector znt by mapping it to a set of. K trainable prototype vectors {c₁, c₂…..c_k}. The matrix formed by these vectors is denoted by C.

Swapped Prediction Problem

We talked about image transformation, the feature vector projection, and code computation(q) but we haven’t discussed why are we doing it this way. As said earlier, one of the goals of this whole exercise is to learn visual features online without any supervision. We want our models to learn robust representations that are consistent across different image views.

The authors propose to enforce consistency between codes from different augmentations of the same image. This is inspired by contrastive learning but the difference is that instead of directly comparing the feature vectors, we would compare the cluster assignment for different image views. How?

Once we have computed the codes zt and zs from two different augmentations of the same image, we would compute the codes qt and qs by mapping the features vectors to the K prototypes. The authors then propose to use a swapped prediction problem with the following function:

Each term on the right-hand side in this equation is cross-entropy loss measures the fit between feature z and code q. The intuition behind this is that if the two features capture the same information, it should be possible to predict the code from the other feature. It is almost similar to contrastive learning but here we are comparing the codes instead of the features directly. If we expand one of the terms on the right-hand side, it looks like this:

Here the softmax operation is applied on the dot product of z and C. The term Τ is the temperature parameter. Taking this loss over all the images and pairs of data augmentations leads to the following loss function for the swapped prediction problem:

This loss function is jointly minimized with respect to the prototypes C and the parameters θ of the image encoder f, used to produce the features znt

Online Codes computation

When we started this discussion, we talked about offline vs online clustering, but we haven’t looked at how this method is online.

In order to make this method online, the authors propose to computed codes using only image features within a batch. The codes are computed using the prototypes C such that all the examples in a batch are equally partitioned by the prototype. The equipartition constraint is very important here as it ensures that the codes for different images in a batch are distinct, thus preventing the trivial solution where every image has the same code.

Given B feature vectors Z = [z₁, z₂, . . . , z_B], we are interested in mapping them to the prototypes C = [c₁, . . . , c_K]. This mapping or the codes are represented by Q = [q₁, . . . , qB], and Q is optimized to maximize the similarity between the features and the prototypes, i.e.

where H(Q) is the entropy function, and ε is a parameter that controls the smoothness of the mapping. The above expression represents the optimal transport problem (more about it later). We have the features and the prototypes and now with that, we want to find the optimal codes. The entropy term on the right-hand helps in equipartition (Please correct me in the comments section if I am wrong).

Also, as we are working on mini-batches, the constraint is imposed on mini-batches and looks something like this:

where 1_K denotes the vector of ones in dimension K. These constraints enforce that on average each prototype is selected at least (B / K) times in the batch.

Once a solution Q* is found for (3), there are two options we can go with. First, we can directly use the soft codes. Second, we can get discrete codes by rounding the solution. The authors found out that discrete codes work well when computing codes in an offline manner on the full dataset. However, in the online setting where we are working with mini-batches, using the discrete codes performs worse than using the continuous codes. An explanation is that the rounding needed to obtain discrete codes is a more aggressive optimization step than gradient updates. While it makes the model converge rapidly, it leads to a worse solution. These soft codes Q* takes the form of a normalized exponential matrix.

Here u and v are renormalization vectors computed using a small number of matrix multiplications using the iterative Sinkhorn-Knopp algorithm.

Side note: Thanks to Amit Chaudhary for pointing out the relevant resources for the transportation polytope and Sinkhorn-Knopp algorithm. You can read about these two in detail here and here

Working with small batches

When the number B of batch features is too small compared to the number of prototypes K, it is impossible to equally partition the batch into the K prototypes. Therefore, when working with small batches, the authors use features from the previous batches to augment the size of Z in (3), and only the codes of the batch features are used in the training loss.

The authors propose to store around 3K features, i.e., in the same range as the number of code vectors. This means that they only keep features from the last 15 batches with a batch size of 256, while contrastive methods typically need to store the last 65K instances obtained from the last 250 batches.

All of the above information is related to online clustering only. Nowhere you discussed the new augmentation strategy. Trying to keep the blogpost short? Huh!

Multi-crop: Augmenting views with smaller images

It is a known fact that random crops always help (both in supervised as well as in self-supervised). Comparing random crops of an image plays a central role by capturing information in terms of relations between parts of a scene or an object.

Perfect. Let’s take crops of sizes 4x4, 8x8, 16x16, 32x32, ….. Enough data to make the bloody network learn, ha!

Well, you can do that, but increasing the number of crops quadratically increases the memory and compute requirements. To address this, the authors proposed a new multi-crop strategy where they use:

  1. Two standard resolution crops.
  2. V additional low-resolution crops that cover only small parts of the image.

The loss function in (1) is then generalized as:

The codes are computed using only the standard resolution crops. It is intuitive that if you include all the crops, it will increase the computational time. Also, if the crops are taken over a very small area, it won’t add much info, and this, very limited, partial information can degrade the overall performance.


The authors performed a bunch of experiments. I won’t be listing down all the training details here. You can read about them directly from the paper. One important thing to note is that most of the hyperparameters were directly taken from the SimCLR paper along with the LARS optimizer with cosine learning rate, and the MLP projection head. I am listing down some of the results here.

Transfer learning on downstream tasks


I liked this paper a lot. IMHO, this is one of the best papers on SSL to date. Not only it tries to address the problems associated with instance discrimination task and contrastive learning but it also proposes a very creative solution to move forward. The biggest strength of this method is that it is online.

Research Engineer, Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store