Embarking on a Quest for the Best On-Site Search Experience: Leveraging Multimodal AI at Boozt

Gábor Jezerniczky
Boozt Tech
Published in
6 min readNov 9, 2023

Discover how Boozt’s Data Science team is pioneering a Proof of Concept (PoC) project using Google’s VisionLM to fuse text and image embeddings, expecting a great enhancement in on-site product search for a tailored customer experience.

Written by Gábor Jezerniczky

The importance of search

In the world of e-commerce, finding the right product quickly is key to customer satisfaction. At Boozt, the challenge grows as our product range expands. We know that having an intuitive, accurate search feature is more than just a nice-to-have; it’s a necessity, ensuring customers can easily navigate our growing selection and find exactly what they’re looking for.

The AI Paradigm Shift

Recent advancements in Artificial Intelligence, especially with the introduction of Large Language Models (LLMs), have created new opportunities to improve how users interact with digital platforms. A notable development is Google’s innovative approach of multimodal generative AI search, bridging the semantic understanding between text and images, which we found to be a game-changer.

Our Quest for Excellence

Being at the forefront of employing data science for better customer engagement, our team at Boozt is ceaselessly exploring novel methods to refine the shopping experience while bolstering business efficiency.

Unfolding of the PoC

Recently we were thinking of revisiting our on-site search architecture to see if we could apply machine learning to improve the quality. After an initial research, we found a very fresh and promising blog post from Google, describing the possibilities of using multimodal embeddings for search tasks.

(the blog that initiated the PoC: https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search)

But what are multimodal embeddings?

In multimodal search, a deep learning model is trained on pairs of images and texts to learn the relationships between them, thus forming a Vision Language Model (VLM) that understands and organizes both textual and visual information. This facilitates the creation of a shared embedding space where both images and texts are mapped based on their semantic similarities: similar items are placed closely together in the embedding space, enabling efficient similarity searches across images and text. For instance, you could search for images using text queries or vice versa, making information retrieval more robust and intuitive across different data modalities​.

Multimodal embeddings space — image-text pairs as vectors

In the above example there are only 3 dimensions visualized in the plot, but actually the model is using 1408 dimensions. Imagine the level of details and relationship such a large size of dimensions can capture.

To test out the capabilities of a VLM model, we first used 2 small subsets of product data: one consisting of 2000 products from the posters category, and another one containing 3000 products from the women dresses category. We used Google’s Contrastive Captioner (CoCa) VLM model to generate the multimodal embeddings from the product images.

The Contrastive Captioner (CoCa) model employs a unique encoder-decoder structure, integrating contrastive loss and captioning loss, aiming at creating aligned image and text embeddings. It performs well in zero-shot learning scenarios, particularly in image classification and cross-modal retrieval.

The model generated a 1408 dimension vector for each product which we indexed using Google’s Vertex AI Vector Search.

The indexing is an effective way for fast information retrieval. Then it was time to test out the new search! The following steps were happening under the hood after we typed in our search query and hit the search button:

  1. The search query was converted into multimodal embeddings vector.
  2. This vector was compared with the already indexed embeddings of the products using approximate nearest neighbors (ANN).
  3. The results were displayed sorted by the highest similar items.

After thorough testing and knowing that no product metadata, but only the product images were used to get these results, we were quite happy that we could for example:

Search for attributes effectively:

Search query: “pastel green stripes”

Search for styles pretty accurately:

Search query: “goth style”

Search for abstract meanings:

Search query: “happy childhood”

Search for a text and the model actually understands the text located on the image:

Search query: “copenhagen”

Although we were quite amazed by the capabilities of such a model, we also found the weak side of only utilizing the multimodal embeddings based on the product images:

Searching for specific brands, product names or categories did not yield the best results:

Search query: “kolekto” (no hits in the top3)

Searching in different languages gave sometimes unwanted products:

Search query: “geltonai taškuotas” (yellow dotted in Lithuanian)

Current Challenges

Based on the experience we gained from the testing, the next step seems to be obvious: how can we overcome the weaknesses of the model?

In the current stage of the project we are experimenting with combining the product metadata (brand names, product names, categories etc.) with the multimodal embeddings in an effective way. For this to happen we need to clearly identify, if the search query includes any of these metadata in order to successfully filter and rank the products. A possible approach could be a fine-tuned NER (Named Entity Recognition) model based on the product metadata.

While the model performs relatively well in widely spoken and well-known languages, it struggles with less commonly spoken languages, such as Lithuanian. Overcoming this issue might be challenging. The simplest approach would be to detect these cases and simply translate the search query before converting it to embeddings. The model fine-tuning could be a very resource intensive task, as we need to create more image-text pairs, but might be worth discovering its feasibility.

Conclusion

The capabilities of new AI models are rapidly advancing. At Boozt, our focus isn’t just on experimenting with these tools, but on implementing them effectively. It’s important to note that while these models offer a strong foundation, they aren’t plug-and-play solutions. Each business and application has its unique demands, nuances, and challenges that require a customized touch. Our machine learning developers play a crucial role in this. They adapt and refine ‘off-the-shelf’ AI products to ensure they fit our specific requirements and maintain our high standards. At Boozt, we believe in harnessing the power of state-of-the-art AI, but more importantly, we believe in the human expertise that refines and perfects it.

About the author

Gábor Jezerniczky | Data Scientist, with a passion for Machine Learning and NLP.

--

--

Gábor Jezerniczky
Boozt Tech

Experienced Data Scientist with solid background in machine learning and NLP.