Paper explained — Exploring Simple Siamese Representation Learning [SimSiam]

4 min readAug 4, 2022

SimSiam which stands for Simple Siamese has first appeared in the paper: Exploring Simple Siamese Representation Learning in 2020 by Facebook AI Research (FAIR).

Siamese networks recently became quite famous in unsupervised visual representation learning. Such models minimize the distance between augmented views of the same sample, and maximize the distance between the rest. Under some conditions to prevent collapse, this method results in clusters of the different classes in the representation space. In this paper, the authors have shown that Siamese networks can achieve impressive results without the use of negative samples, larges batches, dynamic dictionaries or momentum encoders.

In this method, we use an encoder model, a simple convolutional neural network ( a Resent-50 in the paper ) and a prediction MLP head.

Animated illustration of SimSiam training

We first start by uniformly sampling an input image x from the dataset upon which we create two augmented views, let’s call them x1 and x2. Both views are passed to the encoder model f which will output z1 and z2, the representations of x1 and x2 respectively. These representations are then passed to the prediction MLP ( noted h ) that will transform the representation of one view and matches it to the other view’s representation. For example taking the representation z1 and transforms it to p1 = h(z1) and then compare it to z2. This distance is minimized using the following loss function:

Where D(p1, z2) is the negative cosine similarity:

To calculate the loss, the authors treat z1 as a constant by using gradient-stop ( detaching the variable from the graph ):

Here, the encoder on x1 won’t receive any gradient from z1 ( in the second term ) but receives a gradient from p1 (in the first term).

The following is a Pytorch-like pseudo code of SimSiam training:

Encoder model:

The encoder is composed of:

A ResNet-50 backbone.
A projection MLP of 3 layers of size 2048 upon which Batch Normalization and ReLU are applied, except for the output with no ReLU.

Prediction MLP:

The prediction MLP is composed of 2 layers with Batch Normalization and ReLU except for the output with no Batch Normalization nor ReLU. The input and output size is 2048, with a hidden layer of size 512, thus, the prediction MLP is bottelneck structure.

Note: Don’t confuse between the projection MLP that belongs to the encoder and the prediction MLP that transforms representations.

Stop gradient:

The authors show in their paper that stop-gradient is a critical setup in their architecture as neglecting it leads to extremely bad results. The following graph shows that without stop-gradient, the optimizer quickly finds a degenerated solution that leads to a loss of -1 ( the minimum loss achievable ).

To show that this a collapsed solution, the authors studied the standard deviation of the normalized outputs: a standard deviation over all samples is zero for each channel which means a constant output vector.

Batch size:

Unlike SimCLR and SwAV, SimSiam can perform decently with small batch sizes of 64 and 128, the results are pretty similar to when the batch size is from 256 to 2048. Both SimCLR and SwAV require large batch sizes (e.g., 4096) to work well.

Similarity function:

The similarity function used in the SimSiam loss function is the cosine similarity. Although the authors tried to replace this similarity function with the cross-entropy similarity:

This similarity function variant can also converge to reasonable results without collapsing: this suggests that the collapse problem isn’t directly related to the similarity function.

References:

Exploring Simple Siamese Representation Learning paper [SimSiam]