CLIP: revolution in zero-shot learning

Egor Voron

Published in

Product AI

2 min readNov 23, 2021

Solution from:

OpenAI

Goal

These days, machine learning is the undisputed leader in the field of computer vision. However, current models have a number of limitations that significantly narrow the scope of their application. To train classical generators and classifiers, large amounts of labeled data are required for each class of images, as well as the selection of the most optimal architecture for a specific task. A good approach to solving such problems is zero-shot learning, or “learning without training.” With this method, the model is initially pre-trained on a large dataset so that afterwards it does not require additional training on classes that were not in the initial dataset. The challenge is to develop this type of model that can classify and generate images in a quality manner.

Solution

CLIP is a composition of neural networks that very accurately matches images with their textual descriptions. Unlike other classifiers, CLIP does not require additional training on new datasets: to classify any object, one simply inputs a list of classes from which the correct one must be selected.

Under the hood, CLIP contains two neural networks — Image Encoder (IE) and Text Encoder (TE). The first takes an image, the second takes text, both of which return vector representations of the input. To match the dimensions of their results, a linear transformation layer is added to both IE and TE. For IE, the authors advise using ResNet or VisionTransformer, for TE — CBOW.

For pre-training, 400 million pairs of the form (image, text) are used, which are fed to the input of IE and TE. Then a matrix is considered, the element (i, j) of which is the cosine similarity from the normalized vector representation of the “i”-th image and the “j”-th textual description. Thus, the correct pairs will end up on the main diagonal. Finally, by minimizing the cross-entropy along each vertical and horizontal axis of the resulting matrix, we maximize its values on the main diagonal.

Now that CLIP has been trained, you can use it to classify images from any set of classes — simply submit this set of classes, presented as descriptions, to TE, and the image to IE, and see which class represents the cosine similarity of the image with the greatest value.

Application and perspectives

Initially, CLIP was presented in conjunction with DALL-E — a generator of images according to their text description. This technology is used to select the best results. Moreover, CLIP can be used with classic GANs — for example, with VQGAN. Several CLIP models can be used to improve the quality of the results, such as in this colab.

Technologies used:

Programming languages: Python

Framework: PyTorch