SLIP: Self-supervision meets Language-Image Pre-training — Paper Summary
Paper: SLIP: Self-supervision meets Language-Image Pre-training
Link: https://arxiv.org/abs/2112.12750
Authors: Norman Mu, Alexander Kirillov, David Wagner, Saining Xie
Tags: Vision Transformers, CLIP, Self-supervision, Language pre-training
Code: https://github.com/facebookresearch/SLIP
Misc. info: From the same folks as Moco V3
What?
CLIP[1] loss + SimCLR[2] loss for pre-training a Vision Transformer.
Why?
While CLIP is essentially pre-training with a contrastive loss between language and image embeddings of the same concept, SLIP explores if adding additional loss — contrastive pre-training on images will add any value to the pipeline! (Spoiler: It does!)
How?
So how does it work? The left code snippet makes it really easy to understand the idea behind the paper. For a given image, text pair. Essentially we generate 3 views for the image. 2 views are used in SimCLR loss and the third view and the text are used in CLIP loss. The final loss term is the sum of SimCLR and CLIP losses with some weighting. In this algorithm, fi and ft are the encoders, what are hi, ht, and hs? They are projectors from high to low dimensional spaces. hi, and ht are learned linear projectors while hs is a 3-layer MLP head (with ReLU and SyncedBatchNorm). The middle code snippet shows how the clip loss works and the right code snippet is the implementation of SimCLR loss.
Implementation details are as follows.
Architecture: It is similar to CLIP with one image encoder and one language encoder. The authors explored ViT-B/16, ViT-S/16, ViT-L/16 as image encoders. For the text encoder, they used the smallest model from CLIP paper.
Augmentations: CLIP loss image augmentation is only cropping (between 50 and 100%). For SimCLR loss branch, however, the authors use Augmentations from MoCo v3 [3] such as random resized cropping, horizontal flipping, color jittering, grayscale conversion, blurring, and
solarization.
Results
The model is evaluated on 3 different tasks. 1) Zero-shot transfer (i.e. no Finetuning at all) 2) Linear Classification (Adding a classification layer and tuning it with new dataset) 3) End-to-end Finetuning (The whole model is tuned with new dataset). All the models are pre-trained on the YFCC15M dataset (15M subset of YFCC100M consisting of English-only titles and descriptions [1])and the models are evaluated on the ImageNet dataset.
The above figure shows SLIP is doing better than the CLIP model or a ViT pre-trained using just MoCo V3 or SimCLR in all 3 tasks.
So, the authors check different ablations to see the effectiveness of the additional Self-Supervision term. In Table 6, the authors try different SSL methods for image pre-training, we see that SimCLR seems to be performing the best.
In Table 8, the authors check whether adding additional augmentations to the CLIP model can achieve the same level of performance as SLIP, the results show that they do not! The additional pre-training is clearly improving the encodings learned by the image encoder!
In Table 7, the authors check if we take a pre-trained image encoder from SimCLR and then train a CLIP model vs SLIP model performance. SLIP did well in this case too.
In Table 4, the authors show the Zero-shot performance of CLIP and SLIP models on various image datasets. SLIP seems to be outperforming CLIP on an average.
Comments:
- Overall, it is a great engineering paper. The idea is simple, but the evaluation is quite thorough. I wish authors had a better illustration of how it works, the one they have is not quite clear. The algorithm snippet is quite helpful though. (I love these PyTorch pseudo-code snippets in papers! 😁)
- Also, I wish the authors gave a better intuition on why the image representations of SLIP are better than CLIP by visualizing them. Table 4 is quite surprising for me actually! CLIP being trained on so much more data, I expected it to generalize better!
Bibliography:
[1] — Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).
[2] — Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020.
[3] — Chen, Xinlei, Saining Xie, and Kaiming He. “An empirical study of training self-supervised vision transformers.” arXiv preprint arXiv:2104.02057 (2021).