Two minutes NLP — Semantic search of images with CLIP and Unsplash

CLIP’s text and image encoders, the Unsplash dataset, and cosine similarity

Published in

Generative AI

3 min readJan 13, 2022

Thanks to multimodal models like OpenAI’s CLIP and open datasets like the Unsplash Dataset, it’s possible to perform a semantic search for open-source images using natural language descriptions. Let’s see how we can do it!

CLIP

CLIP (Contrastive Language–Image Pre-training) is a neural network model which efficiently learns visual concepts from natural language supervision.

CLIP is trained on a dataset composed of pairs of images and their textual descriptions, abundantly available across the internet. Given an image, the model predicts which out of a set of randomly sampled text snippets was actually paired with it in our dataset. This simple task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets.

The model is composed of a text encoder and an image encoder, both based on the Transformer architecture.

CLIP model training overview. Each image and text in the dataset are encoded by their respective encoders, and the model is trained to predict which text is paired in the dataset to which image. Image from https://arxiv.org/pdf/2103.00020.pdf

Unsplash Dataset

The Unsplash Dataset is a high-quality open image dataset released in 2020, free for anyone to use to further research in machine learning, image quality, search engines, and more. The dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts.

There’s a lite version of the dataset for commercial and non-commercial usage with 25.000 images, and a full version for non-commercial usage only with 3.000.000+ images.

Semantic search with CLIP and Unsplash

The pipeline is made of the following steps:

Download the CLIP model and the Unsplash dataset.
Use CLIP’s image encoder to encode all the images in the Unsplash dataset and store them.
Use CLIP’s text encoder to encode a text query.
Compute the cosine similarity between the query embedding and all the images embeddings.
Retrieve the top N images with the highest similarity and show them to the user.

You can find the embeddings of the full Unsplash dataset already computed in the repo, along with instructions on how to compute them from scratch. You can try the complete pipeline (without the image encoding computations, since they can be downloaded) with this Colab.

Here are some example text queries with image outputs.