Building a powerful Image Search Engine for your pictures using Deep Learning

Manuel Faysse
CodeX
Published in
4 min readJul 22, 2021

A few days ago, I felt the desire to look back at an old picture I had a vivid memory of but had no idea where to find…Since the picture in question was taken, I changed phones twice, laptop once, and I was pretty sure I had sent it through Messenger to someone at the time, but who? How convenient it would be to be able to search through all my pictures with a simple descriptive query, and locate it …

Recent advances in Computer Vision have improved the relevancy of image embeddings (dense vector representations) and with the recent CLIP model, implementing a Google-like Image Search for my local pictures was now easily within reach.

Without diving into details (refer to the blog post and paper for more information: https://openai.com/blog/clip/), CLIP is a neural network built to learn image features through natural language supervision. Basically, using public images on the internet with associated captions, it embeds the text with a BERT-like language model and the image with a vision transformer. Note that the technique used can be applied with other NLP and CV model architectures. With multiple image/text embedding pairs, it is possible to fine-tune both the vision and text embedding models through in-batch negative contrastive training, similarly to what can be done in the NLP field of Information Retrieval. Basically, the goal is for the image embedding to correspond (dot-product) to its associated text embedding, and to be different from all the captions of the other images (1).

CLIP training and inference scheme (https://github.com/openai/CLIP)

CLIP is generally intended to be used for “Zero-shot” classification; given an image and list of captions, it infers what the best caption for the image is. In the above example (2), “a photo of a dog” is the best caption for the image, compared to “a photo of a plane”, “ a photo of a bird”, “a photo of a car” …

My idea for the image search engine (no novelty here), was to flip this around and instead of classifying captions based on an image, rather to classify images based on a text query. The process would be as follows:

  • Locate all images in a given directory
  • Use the pre-trained CLIP vision transformer to compute the embeddings of each image and store them for future reference, along with the image path.
  • At runtime, convert the user query into a text embedding using the CLIP text transformer.
  • Compute the dot product of the text embedding with all of the stored image embeddings, sort all images by their obtained score, and return the paths of the N highest ranked images.

This process, along with some extra features, is implemented in my Github repository: https://github.com/ManuelFay/ImageSearcher.

During the indexing phase, the code uses the oslibrary to find all pictures in a given directory and subdirectories, embeds and stores the vectorized representations using the transformers and pickle library. At run time, the pickled embeddings are loaded, matched against the embedded query and then-best ranked images are returned. A Flask / Gunicorn API is provided to be able to efficiently use the search engine with an external interface. A simple Google Image Search-like web interface built with Vue.js is also provided.

Examples

“A photo of a cute kitten”. Here, images that are twice on my computer appear in duplicate.

To obtain a large number of pictures, I downloaded my Messenger archives from Facebook, obtaining about 10,000 pictures I had sent and received over the past few years.

“A photo of a guy with a sweatshirt and a cap in the mountains”: All of the top options are not necessarily exactly as queried, but the search is overall very efficient.

The search engine allows making very descriptive queries. Top-ranked images are ranked first. Note that these images all provide from my ~10,000 local pictures so the pool of options is limited.

Meta-queries are also possible. Here we request pictures taken by a drone:

“A drone shot”: 5 of the top 6 pictures are indeed taken by drone, while the remaining picture could easily have been shot during a drone fly-by.

This was a quick afternoon project, but I was impressed by the precision of the CLIP model. To test it yourself, use the code from https://github.com/ManuelFay/ImageSearcher. Contributions for improvements and extra features are welcomed!

--

--

Manuel Faysse
CodeX
Writer for

Data science and Robotics MSc degree from EPFL, ex-intern at Illuin Technology and Imperial College London.