Two minutes NLP — Semantic search of images with CLIP and Unsplash
CLIP’s text and image encoders, the Unsplash dataset, and cosine similarity
Thanks to multimodal models like OpenAI’s CLIP and open datasets like the Unsplash Dataset, it’s possible to perform a semantic search for open-source images using natural language descriptions. Let’s see how we can do it!
CLIP
CLIP (Contrastive Language–Image Pre-training) is a neural network model which efficiently learns visual concepts from natural language supervision.
CLIP is trained on a dataset composed of pairs of images and their textual descriptions, abundantly available across the internet. Given an image, the model predicts which out of a set of randomly sampled text snippets was actually paired with it in our dataset. This simple task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets.
The model is composed of a text encoder and an image encoder, both based on the Transformer architecture.
Unsplash Dataset
The Unsplash Dataset is a high-quality open image dataset released in 2020, free for anyone to use to further research in machine learning, image quality, search engines, and more. The dataset is created by 250,000+ contributing photographers and billions of searches across thousands of applications, uses, and contexts.
There’s a lite version of the dataset for commercial and non-commercial usage with 25.000 images, and a full version for non-commercial usage only with 3.000.000+ images.
Semantic search with CLIP and Unsplash
The pipeline is made of the following steps:
- Download the CLIP model and the Unsplash dataset.
- Use CLIP’s image encoder to encode all the images in the Unsplash dataset and store them.
- Use CLIP’s text encoder to encode a text query.
- Compute the cosine similarity between the query embedding and all the images embeddings.
- Retrieve the top N images with the highest similarity and show them to the user.
You can find the embeddings of the full Unsplash dataset already computed in the repo, along with instructions on how to compute them from scratch. You can try the complete pipeline (without the image encoding computations, since they can be downloaded) with this Colab.
Here are some example text queries with image outputs.