Making groceries search results more relevant at Mr D

Stephen Hunter
Takealot Engineering
7 min readJul 23, 2024

Introduction

Mr D started offering customers the ability to order groceries for delivery in late 2022. When we first launched groceries, we served search results in a relatively “vanilla” fashion: lexical matching to product label and section title.

What this means is that one or more words in the search phrase needs to match the product label or the name of category to which the item belongs.

The result of this is that some items might not get served in the search result when they should be. For example, searching for “toilet paper” might leave out products with the label “toilet tissue” and vice versa.

In addition, sometimes no results are found for a relatively straightforward search. One of my favourite examples of this is the search “craft beer”. Even though many stores sell a selection of craft beers, no results would be returned for that search.

An unsuccessful search for “Craft beer” with lexical term matching

We knew this is something we wanted to improve upon. As anybody who has been following advancements in AI recently can tell you, the ability of machine learning models to learn from, process, and generate natural language has progressed by leaps and bounds. A key part of this is the concept known as an embedding.

Embeddings are high-dimensional vectors that represent words, phrases, or even entire sentences. Each dimension captures some aspect of the meaning, usage, or context of the word(s) involved. This allows models such as Large Language Models (LLMs) to essentially “understand” the semantics contained in a piece of text.

To add to this, it is important to know that embedding vectors of similar things will tend to cluster together in the vector space. In the context of groceries, things like fruit (“bananas”, “apples”, “strawberries”, etc.) will cluster together in one part of the space, while cleaning materials (“detergent”, “soap”, etc.) will cluster in a different part of the space.

We wanted to harness this power of embeddings in our groceries search offering. Luckily, a key part of the solution was a feature that had already been introduced to OpenSearch, the search engine we were using in the background. This feature is known as “k-NN search”, which is short for k-nearest neighbours. It allows us to search for the k-nearest neighbours to a query point across an index of vectors [1].

Putting it another way, with k-NN search we assign each product in our catalogue a vector, and when searching, the search term (query) is translated to a vector in the same space. The nearest points to the query are the results most closely matching the query. When the vectors involved represent the essential idea of what the items and queries are about, this can be called Semantic Search.

This technology not only addresses the “craft beer” search issue but allows customers to use natural language to search for a broad set of items, such as “cleaning stuff”, “healthy snacks”, “cold cuts”, and so on.

Embeddings do not have to be purely text-based. In the context of a grocery catalogue, there are cases where an item’s name could be too specific. For example, a craft beer might only be called by its brand name, and contain none of the words “craft”, “beer”, “ale” or “lager” in its name. In addition, the brand name might be too obscure for any pre-trained model to be able to make the connection to craft beer.

For this reason, using the image of the item could be a powerful way of transcending the limitations of purely text-based semantic search. Luckily for us, CLIP, a powerful open-source multi-modal embedding model, was already available for us to test and potentially use for vector-based search.

As the saying goes, “a picture is worth a thousand words”!

CLIP for semantic search

What is CLIP?

In 2021, OpenAI released CLIP (Contrastive Language-Image Pre-training) [2]. This is a multi-modal deep learning model that works on both images and text. It can act as a zero-shot visual classification model, only requiring a user-supplied list of categories alongside the image to be classified.

It consists of two encoders, one for images, and one for text, which are used to embed images and text into a common vector space.

For those interested, OpenAI’s blog post on CLIP is very informative and can be found at this link.

How we used CLIP for search

So how does CLIP extend to semantic search? Put simply, we used the pre-trained CLIP model to produce vectors of the full product catalogue. This resulted in a database of items and their respective vectors, where these vectors represented CLIP’s “understanding” of four essential attributes of each item:

  • Image, e.g.
  • Label, e.g. “Apples 1.5kg”
  • Aisle name, e.g. “Fresh Fruit and Vegetables”
  • Sub-aisle name, e.g. “Fresh Fruit”

As depicted in the figure below, the image vector is obtained from CLIP’s image encoder, while the text vectors are obtained for CLIP’s text encoder. The four vectors are stored in a database for future retrieval.

Before we index these products into OpenSearch, we add the vectors together to produce a single vector that is the sum of the four vectors. This allows us to capture both the image and text attributes of the model in one stand-alone vector. We can do this because both the text and image embeddings of CLIP operate in the same vector space.

The specific contribution of each vector we used was based on a trial-and-error approach where we compared retrieval accuracy for various test cases.

Our vector production pipeline

Finally, the resulting vectors were indexed into our OpenSearch cluster. This paved the way for us to use OpenSearch’s built-in vector search functionality to conduct searches and serve semantic search results.

Finding nearest products

Indexing the products into OpenSearch requires choosing which nearest neighbour algorithm will be used for conducting searches. The indexing process makes it faster to search through a very large number of items, and typically means a small compromise in retrieval accuracy compared to an exact brute-force search.

Based on our reading of the OpenSearch documentation and our use-case, the best option was the Hierarchical Navigable Small World (HNSW) algorithm coupled with the Lucene engine. [3]

Serving Search Vectors

An essential part of vector search is being able to translate a search term, such as “craft beer” into a vector, so that the search engine can take that vector and perform k-NN search against it.

For Mr D, this is achieved by hosting the CLIP model’s text encoder in a dedicated service that can serve these vectors in real time. Each time we want to execute a search, we fire off the request to a Sagemaker Endpoint, providing it with the word or phrase for which we want a vector, such as “Bananas” or “Craft beer”. The endpoint then responds with the vector within about 100ms.

Serving vectors for search

Searching for items

Upon receiving the desired vector, the search service submits two parallel search requests using msearch():

  1. Normal lexical search against the search term, e.g. “Bananas”
  2. Vector search against the vector that was received, e.g. [-0.13, 0.29, 0.76, …]

When performing k-NN search, we get a similarity score for each item as measured against the target vector. We can specify a threshold for this score when submitting the search, and this becomes important if we want to reduce the presence of irrelevant items in our search results. This is more visible when submitting obscure queries for things that do not exist in the catalogue. To arrive at this threshold number, we did extensive testing to balance the trade-off between retrieval and relevance.

The result of each of the above search requests will be a ranked list of items, from most to least relevant as determined by the settings of the lexical search and k-NN search functions of our OpenSearch cluster.

Since we need to serve a single result set to the customer who performed the search, we have to combine the ranked lists somehow. For that, we use a simple algorithm called Reciprocal Rank Fusion (RRF). RRF combines the two lists into a single list, while still accounting for the position of each item in both lists. [4]

Conclusion

And there you have it! While not every technical aspect of our Semantic Search implementation was covered here, we hope this article gives a better understanding of the interesting technology used to translate search queries into results on the Mr D app.

To finish off, search results for some natural language searches are shown below. Importantly, take note of how the label of the resulting items often do not contain any of the words that were used for the search phrase.

Semantic search for “cleaning stuff”
Semantic search for “healthy snacks”

And to go full circle on the Craft Beer search, this is what the result looked like soon after we deployed the solution to production:

Finally, a successful search for Craft Beer!

References

  • [1] OpenSearch k-NN documentation
  • [2] OpenAI CLIP blog post
  • [3] OpenSearch approximate k-NN documentation
  • [4] RRF article

--

--