Search Engine for clothes using CLIP

Published in

Crayon Data & AI

8 min readJun 28, 2023

A few months ago I went to a conference for retail. The weeks before, I was very into OpenAI’s CLIP model, which combines Computer Vision models with Natural Language models (don’t worry, I’ll explain it later).

After brainstorming for a while about what we should show at this conference, I came up with a course idea for finding the best matches between clothes and text. Unfortunately, we did not give it a fancy name like Tinder for clothes, or maybe the ChatGPT of clothing, although now that I think about it, that would have been a very nice marketing move. We called it: The Clothes Search Engine

So what were we going to do with it? The idea was that we could find some similar clothes in a dataset containing a pool of clothes, just by doing a similarity search.

Similarity Search

When we talk about similarity search, we are talking about finding similar objects in a pool of objects. In the field of Computer Vision, we like to run some Neural Networks on the whole dataset of images and compute a so-called embedding out of it. I assume that you have some basic knowledge about Neural Networks, otherwise, this article would be much longer.

An example of an image embedding with Convolutional Neural Networks

As you can see in this image, we are running a Convolutional Neural Network on an input image with an n-dimensional vector output. If we wanted to do some kind of classification on the image, we would need an additional head on top, which is just a Fully Connected Layer with N output channels, each representing a class.

We now assume that this n-dimensional embedding is representative of our input image and that if we used some similar images (like the same object but from a different perspective) we would get a similar embedding. So how would we train something like this? Fortunately, we can do it in several ways. First, as mentioned above, we can just train a model by doing a classification (e.g., ResNet with ImageNet classification) and just use the output of the last layer before the last Fully Connected layer. Another approach is to use some kind of Contrastive Learning (e.g. SimCLR) or Unsupervised Learning approach (e.g. Dino), which are trained to output the embeddings of (augmented) same images as similar as possible (ok this was very simplified, but just check the papers if you are interested in this stuff, it is awesome what they are doing there).

We just talked about Convolutional Neural Networks, but we are not limited to this kind of model. Other models like Vision Transformers do the same thing. We compute an embedding, add a head to it, and train an image classifier on it. We can use this embedding like we used the convolutional Neural Network embedding.

But what can we do with these embeddings? As already mentioned, we assume that these embeddings are very similar to each other if they represent the same object, so the idea is to compute the similarity between two embeddings to see how similar they are. Most often we use cosine similarity:

So if we can compute image embedding with a Vision Transformer, why can’t we compute text embedding with a Transformer Encoder? Well, it turns out that we can (paper).

CLIP

Now that we know how to compute the embeddings and how similar the two embeddings are, we can start to understand how the CLIP model works in general. The idea is very simple, and to be honest, sometimes I think, “Why did I not think of this first” :) But then I remember that even if I had thought of it first, I could never train the underlying model with my personal money.

So how do we do this? The idea is to compute a set of image embeddings on one side and a set of matching text embeddings on the other. Now, if we normalize all our embeddings (remember the cosine similarity formula), we can multiply the two lists of embeddings and we get a matrix. Let’s do the quick math on this:

Let’s say we have 8 image-text pairs and the image and text embeddings are of size 512 (yes, both have to be of the same size, we’ll discuss this later). We get two tensors of size 8×512 (a=8×512, b=8×512). If we now matrix multiply these two by transposing the second tensor (a×bᵗ), we get a matrix of size 8×8. Each entry of this matrix represents the similarity between one of the given image-text pairs. This also means that we want the diagonal to be as similar as possible and the rest to be as different as possible. By maximizing the diagonal and minimizing the rest, we can train the image and text embedders simultaneously to project the embeddings into the same Latent Space.

How can we be sure that the two embedders will output the same dimensional embeddings (512-dimensional in our example)? Well, we can’t. That’s why we will add an additional Fully Connected Layer to the two models to project the embeddings into the same directory.

The Clothes Search Engine

The first big problem was to get some data. Luckily there is always some free dataset out there. In our case, it was some Zalando Dataset [0], which has ~8000 image text pairs of pictures of dresses and a description in German. To be honest this was a lucky shot since the conference was in Germany, which made it easier to decide that I wanted to use a multilingual CLIP model, which I found on Hugging Face.

An example image of the with two description text and their English translation [0]

The first order of business was to compute the embeddings of all the images and store them in a database so that I could easily access them. Since I am not the first person to do this kind of search, there is already a tool called faiss that can be used as both a database and a search engine. So once we have calculated all the embeddings of the images, we can add them to the faiss database. Using cosine similarity, we can find the top K best matching images by passing an embedding. We can also get the cosine distance between all of them (but we do not need it for our purpose).

And that’s it. That’s all the magic there is to make it work. I would say that if there were not some problems and I wanted to add some stuff. But let’s start with some pseudocode first, to understand more clearly how we’re structuring this:

# Load necessary data

image_embedder = CLIP.load_image_embedder_model("path_to_weights")
text_embedder = CLIP.load_text_embedder_model("path_to_weights")

def get_embeddings(model, data, preprocess=None):
  if preprocess is not None:
    data = preprocess(data)
  embeddings = model(data)
  embeddings /= text_embedder.norm(2)
  return embeddings

def load_images(path):
  ...

def load_text(path):
  ...

images = load_images("path_to_images")
texts = load_text("path_to_texts")
preprocess_image = ...

# Compute embeddings

image_embeddings = get_embeddings(image_embedder, images, preprocess_image)
text_embeddings = get_embeddings(text_embedder, texts)

# Compute the prediction matrix (only needed when training)

prediction_matrix = image_embeddings @ text_embeddings.T

# Create the faiss Dataset with dim=512

index = faiss.IndexFlatIP(512) # Flat Inner Product
index.add(image_embeddings)

# Search for closest 5 matches

one_new_image = load_images("path_to_one_new_image")
image_one_new_embedding = get_embeddings(image_embedder, 
                                         one_new_image,
                                         preprocess_image)

distance, index = index.search(image_one_new_embedding, 5)

one_new_text = load_text("path_to_one_new_text")
text_one_new_embedding = get_embeddings(text_embedder, one_new_text)

distance, index = index.search(text_one_new_embedding, 5)

As you saw in the pseudocode, we can find the closest matching dress by searching with an image or text embedding. This means that we can create our demo UI (I would highly recommend using Gradio to create applications, it is very easy, but I will not go into details).

This interface allows you to search by uploading an image (or using a sample image). When we click on “Search with Image!” we can, well, search with the image and return the closest matching image or images. In my case, it returned the top 12 images.

You can also search using text only, as you can see in the following image (I searched for “Flowers”):

Now for a nice addition. Remember how we talked about how the embeddings are trained so that the text and image embeddings are projected into the same Latent Space? Well, we can use that fact to combine both embeddings. We just compute a linear interpolation between them, which should return some combination of the image and the text.

Yay! It does what we hoped. We can clearly see that the style and sometimes color of the dress are kept similar but with additional flowers.

But wait, I mentioned some problems with this approach. As good as the CLIP model is pre-trained, it had some problems with combining German and dresses. Here is an example where I searched for “Punkte (engl. “dots”) and got some really bad results:

The reason for this is that the original CLIP model was trained on a large amount of data. This multilingual CLIP was also trained on a lot of data, but learning that the word “Punkte” should mean “dots on a dress” is not something we should expect. So how did I get better results? Well, by fine-tuning the multilingual CLIP model on the given dataset. And by luck, the text was already in German, which was exactly what we wanted.

After a few epochs, we got better results:

But still not really satisfying. But after more training, we got better:

While not perfect, the results improved after only a few epochs of training. To further improve the results, I could have used a better label than just “Punkte”, but we can already see how easily we can get better results just by fine-tuning the model with domain-specific data.

Conclusion

We learned something new today (I hope), at least I did while making this demo. But the most important thing is that sometimes it is easier than we expect to create some quick demos that can be reused for different conferences (I could create the new tennis racket Tinder at another conference) or also with customers where being able to quickly pull out and present a demo can be the game changer.

References

[0] Lefakis, Leonidas, Alan Akbik, and Roland Vollgraf. “Feidegger: A multi-modal corpus of fashion images and descriptions in German.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.