Enhancing Vector Search with Gemini as Multimodal Re-Ranker

Why Vertex AI Vector Search and Gemini’s Multimodal Capabilities are a Perfect Match

Sascha Heyer
Google Cloud - Community
5 min readJul 11, 2024

--

Traditional vector search techniques primarily focus on visual, textual, or in general similarity within the embedding space.

This approach often overlooking the contextual richness that can significantly enhance search relevance.

This article explores how multimodal models like Gemini can act as powerful re-rankers, refining search results by integrating diverse data sources, including images and text.

Fully Scalable Production Ready Solution

The full code for this article is pushed to GitHub and ready to be used. If you have questions, don’t hesitate to contact me via LinkedIn.

Why multimodal models as rankers are a good idea and the main problem without it

Imagine we have many product images stored in a vector database. Some of those products have multiple photos from different angles.

Let's assume we want to find all images for one specific product, like the BEATS SOLO3 Headphones.

During the retrieval from the vector database, we also get a similar score with potentially similar product images. The similarity score isn’t a calibrated probability, meaning we get different distributions for different use cases. Therefore, we can not simply define a fixed threshold or measure the quality to ensure those headphones are the same.
In other words, you end up in a situation where you realize you always have a few products slipping through this threshold.

  • If the similarity threshold is > 0.7, we will add the wrong products
  • If the similarity threshold is > 0.8, we will miss products.

It’s hard to find the perfect threshold. It would fit if we use a threshold > 0.75, but remember, the threshold varies for different images, and you cannot always get it right.

Some use cases can work with this fuzziness, but others require a more accurate solution. To solve this challenge, we use the output of our vector search as input for our multimodal ranker.

Which leads us to exact search results.

Important to understand
While this was an image-related example, it also works for text or any other modality. There are no limitations to the re-ranker. Anything you can describe as a prompt can be used.

Stay with me 🦄. We cover the actual implementation, including the prompt, now.

Understanding the Concept of Multimodal Models as a Ranker

Initial Vector Search

The initial vector search retrieves a set of candidate results based on similarity. This foundational step generates a broad spectrum of relevant items.

Multimodal Refinement

A multimodal model like Gemini subsequently processes the retrieved results. This model employs a combination of visual features, text analysis, metadata, and contextual information to refine the search results further.

The multimodal model re-evaluates the initial search results (candidates) using a prompt describing how we want to rank. We can also include additional information, such as images, which provide visual features alongside related text descriptions, tags, and contextual metadata. For text, it considers semantic meaning, contextual relevance, and associated visual data if applicable.

By considering these additional factors, the multimodal model reorders the results to match the user’s query and intent better.

This might sound complex, but in the end, it’s just a carefully crafted prompt that defines what the ranker should do.

Usage

The usage is easy and flexible to adapt to different re-ranking use cases.

  1. We retrieve the vector search results.
  2. We add the retrieved results as parts of our multimodal model. In this case those are images but it could be also text, video, or audio. Any modality that is supported by Gemini can be used.
  3. The prompt that defines your re-ranker. In our case we want to re-rank based on product similarity. You could add additional information to your re-rank prompt, for example if you want to re-rank based on user information you could provide that to the model as well.
  4. Additionally we use Controlled Generation to ensure the Gemini model is always returning correct JSON. Check out my article about Controlled Generation.

With those well defined 4 step process we are able to re-rank the vector search results with the help of a multimodal model.

matches = index_endpoint.find_neighbors(
deployed_index_id="product_similarity",
queries=[embeddings],
num_neighbors=10,
)
print(matches)

parts = []

for idx, match in enumerate(matches):
if idx >= limit:
break

image_uri = f"gs://doit-image-similarity/{match['id']}"
parts.append(f"({image_uri})")
parts.append(Part.from_uri(uri=image_uri, mime_type="image/jpeg"))

prompt = f"""We have a product database and we need to find similar products.
Given the following product images, return the ones that are the same.
"""

parts.append(prompt)

response_schema= {
"type": "object",
"properties": {
"matching_product_urls": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["matching_product_urls"]
}

generation_config = GenerationConfig(
temperature=1.0,
max_output_tokens=8192,
response_mime_type="application/json",
response_schema=response_schema
)

responses = gen_model.generate_content(
parts,
generation_config=generation_config,
safety_settings=safety_settings,
stream=False,
)

multimodal_result = responses.candidates[0].content.parts[0].text

Limitations

While multimodal and LLM models improve almost weekly, they still add latency to this approach. The latency depends on your use case, the model used, and the number of input tokens you send to the model.

Having images as part of your ranker adds significantly to the latency.

Latency is added when using a ranker with images.

Using only text as input for the re-ranker can lower the latency. Make sure your application is suitable for this additional latency and measure it.

Conclusion

The refinement step improves the relevance of the search results. The model ranks the items not solely based on visual or textual similarity but also on how well they full-fill a ranking prompt.

This approach ensures that the most contextually appropriate and semantically relevant results appear higher in the list and can act as a more advanced filter.

The ranker enhances the user experience by providing more relevant and accurate results. Users are more likely to find what they seek quickly and efficiently, whether searching via images or text.

Thanks for reading

I appreciate your feedback and questions. You can find me on LinkedIn. Even better, subscribe to my YouTube channel ❤️.

--

--

Sascha Heyer
Google Cloud - Community

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏 bit.ly/sascha-support