MILVUS HNSW Exploration

FARFETCH Tech
FARFETCH Technology
9 min readApr 12, 2023

By Pedro Moreira Costa

Introduction

FARFETCH Chat R&D aims to deliver a conversational recommender system that helps users interact with the FARFETCH product catalogue through natural conversation. It is positioned as a chatbot that allows users to communicate with it via text and product images. Users can initiate a conversation and refer to a product they’re interested in. For example, a user can request products similar to a picture of a jacket they submitted. The chatbot answers with a selection of jackets most similar to the user’s request. See figure 1 for a screenshot of this.

Figure 1: FARFETCH Chat show similar showcase

Products are retrieved based on the similarity between the user query embedding and the products’ embedding, using a Vector Search Engine (VSE).

FARFETCH Chat R&D currently employs Milvus as a VSE to support the product catalogue embeddings and uses Hierarchical Navigable Small Worlds (HNSW) [3] as the index to deliver product retrievals. In this article, we will research the best possible HNSW configuration in Milvus. The main goal is to study the execution times during indexation and search for a wide range of parameter configurations.

HNSW methodology

HNSW is a graph-based Approximate Nearest Neighbour (ANN) algorithm based on Navigable Small Word (NSW) graphs. HNSW incrementally builds a multi-layer structure with clustered subsets of the added data points. The topmost layer, where a new data point is added, results from an exponentially decaying probability distribution. The newly inserted data point is then re-inserted in all lower layers down to layer 0. By populating the upper layers with less dense clusters, HNSW has a performance boost over NSW.

Once the data point position is calculated, the algorithm runs a two-phase index creation process. For an in-depth analysis, please refer to [1].

Experimental Setup

Machine

CPU: Intel Xeon E5–2690 v4
RAM: 112 GB
Disk: 1024 Gb HDD
Operative System: Linux 16.04-LTS
Environment: Anaconda 4.8.3 with Python 3.8.12
VSE: Milvus 2.1.2 Standalone

Product Catalogue

We are currently using a snapshot of the FARFETCH Product Catalogue containing a total of 750578 scrubbed products, with up to:

3484 brands;
243 categories.

Encoder model

We leveraged a state-of-the-art model called CLIP [2] to map product descriptions and product images to a common embedding space. CLIP is a model designed to learn multimodal representations through supervised training, pairing visual embeddings of images with the corresponding text embeddings from product descriptions. This enables querying for images with a text input (and vice-versa) since the corresponding text embedding can be used to retrieve the closest visual embeddings in the catalogue.

By fine-tuning the pre-trained CLIP model on Farfetch product catalogues, it was possible to obtain a model capable of retrieving products, whether the input query is an image or text. The mapped embeddings have a dimension of size 512.

The product representation consists of the average of the image and text embeddings. The textual information to encode the text embeddings consists in a combination of the following product attributes:

Short description;
Brand;
Category tree;
Gender;
Colours.

HNSW parameter configurations

HNSW’s index creation unfolds in two phases. In the first phase, the algorithm uses a probabilistic distribution to predict the top layer from which the new data node is present. During the second phase, each data point’s nearest neighbours are gathered and trimmed by M. This process is repeated from the insertion layer to the bottom layer.

To identify how the algorithm parametrization impacts the results, we will tune the values for the construction and search parameters in HNSW.

hnswlib provides its own description on the parameters as well as suggestions for their value.

The configurations listed below were inspired by[1] exploring HNSW, while considering Milvus’ limitations. Below is a brief explanation of the HNSW parameters:

M: number of selected nearest neighbours from the candidate list ef which will be bidirectionally linked in the second phase of a data point q insertion;
efConstruction: number of nearest neighbours candidates gathered during the second phase of a data point q insertion;
efSearch: number of nearest neighbour candidates gathered from greedily traversing each HNSW layer until finding a local minimum in the lowest layer (layer 0).

For a detailed explanation on these parameters and their employment during search and also indexing, please refer to [1].

Index configurations

Config 1:

  • M = 16;
  • efConstruction = 16;

Config 2:

  • M = 32;
  • efConstruction = 32;

Config 3:

  • M = 64;
  • efConstruction = 64;

Config 4:

  • M = 64;
  • efConstruction= 128;

Config 5:

  • M = 64;
  • efConstruction = 256;

Config 6:

  • M = 64;
  • efConstruction = 512;

Search Configurations

Considered efSearch values: [16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

Number of returned results (topK): 10

Search types

We shall use the following nomenclature for clarity and readability:

Pure search: uses the extracted and encoded text query as input;
Hybrid search: uses the encoded text query along with the brand as facet (if present in the query) as search input. This allows narrowing down search results by applying the brand filter. When the brand is not available, it will revert to the pure search mentioned above.

Observed queries

The observed queries are variable in complexity. Some queries only contain brands, while others also present categories and colours. These queries refer to one of the more simple queries the system is expecting up to a more complex one, detailing brand, category, and colour of the expected results.

“Ralph Lauren”;
“Ralph Lauren polo shirts red”;
“versace”;
“Versace Shoulder bags”;
“Dolce & gabbana”;
“Dolce & gabbana white t-shirt”.

Results Analysis

Indexation

During indexation, as the efConstruction parameter increases, more comparisons between nearest neighbors are necessary, leading to an increase in the index creation process. Figure 5 demonstrates just that, as with each complexity increase in the index parametrization, the execution time is impacted.

Figure 5: Index Execution Time for all of the 6 considered index configurations

Do note that the collected execution time refers exclusively to the index creation after the collection has been populated. Table 2 presents the time required to upload the entire product catalogue into the collection and the total execution time for the indexing process. While Configuration 1 was affected by the first data upload, the remaining upload times represent a less significant difference. The total execution time has also been impacted by more complex configurations.

Table 2: Upload and average upload and index time required by each index configuration

Search

Figures 6 and 7 summarise all index and search configuration combinations used to search using the queries ‘Ralph Lauren and ‘Ralph Lauren polo shirts red.’

In Figure 6, it is noticeable that hybrid search is faster than pure search for all indexation and search configurations. Hybrid search execution times vary between 0.02 and 0.03 for all configurations, while pure search ranges between 0.05 and 0.065 seconds. By using hybrid search, we can decrease the execution time by half when compared to pure search. The execution times of all search configurations are consistent and always follow the same pattern for all indexation configurations.

Deep diving into the reasoning behind the execution time differences of hybrid and pure search, we tried to understand why this behaviour was happening for the query. Looking at our catalogue, we found out that only 1% of the products belong to the brand Ralph Lauren. That being said, hybrid search first will perform a faceted search and select all products from Ralph Lauren (a total of 7833 products), and then it will perform a pure search over that small subset of products.

When comparing heavier and lighter indexing configurations, there are no significant differences in terms of execution time.

Looking at the execution times for query ‘Ralph Lauren polo shirt red’ (see Figure 7), which is a more complex query, similar to the first query, it is noticeable that hybrid search is faster than pure search for all indexation and search configurations. Hybrid search execution times vary between 0.02 and 0.035 for all configurations, while pure search ranges between 0.05 and 0.065 seconds. Given that this hybrid search approach uses specifically the extracted brand facet only, we are narrowing the index to Ralph Lauren products, similarly to what is done in the previous query. Thus, the search query is carried out on a smaller subset of products, effectively reducing the necessary comparisons to attain a result of relevant products.

The remaining queries showed the same behaviour as the previous queries presented. It is noticeable that hybrid search is faster than pure search for all indexation and search configurations (please refer to Appendix A).

We can also observe a consistency between search types, with minor variations in the overall execution time, between index and search configurations, givenMilvus’ query node mechanisms to speed up search queries. This consistency can also be attributed to the similar presence of the queried brands within the product catalogue and how we’re querying Milvus sequentially (one query at a time). One additional note regarding the captured execution times is the inferred QPS (Queries per second). The observed values do not seem to follow eitherMilvus’ own benchmark norANN-benchmarks. However, we’ve previously found ANN-benchmarks to run an outdated Milvus version. They also strip VSEs and algorithms of background routines to ensure that the studied processes run single-thread. Milvus’ own benchmark presents crucial hardware differences from our own machine and runs their tests with a machine allocated specifically for Milvus. In our case, our encoders shared resources with the Milvus instance.

The remainder of the queries present a similar behaviour (see Appendix B).

Figure 6: Global Execution time box plots for query “Ralph Lauren”
Figure 7: Global Execution time box plots for query “Ralph Lauren polo shirts red”

Table 3 presents the total occurrence of products from the brands present in the observed queries. All of the seen brands have a high count of products, meaning that the efSearch value can be maxed out without compromising the number of returned results. In the course of this experiment, topkis fixed at 10, but, given this insight, we would be able to retrieve up to 234 relevant products before impacting the results’ quality (from all of the more complex queries, “Versace Shoulder bags” is the one with the lowest presence in the catalogue).

Table 3: Product Catalogue brand and query product frequency

To further analyse the execution times of all indexation and search configurations, we computed paired t-tests (significance level of 0.05) for all index and search configuration combinations. The paired t-test shows that most configurations follow the same mean execution time. However, configurations using index config 1 and an efSearch of 16, 64, and 128 almost never match the mean (theirs is higher) of any other configuration, except between them (see Figure 8).

Another interesting find is related to the more complex configurations. Namely, they seldom present a statistical difference from the remaining configurations, except for the configuration using index config 6 and efSearch 2048. When observing all configurations following a different mean than this one, we find that this config has the lowest median, one of the lowest spread and no outliers.

Figure 8: Paired t-test heatmap for query “Ralph Lauren polo shirts red”, using hybrid search only

Final Remarks

The presented experiment and its results were fundamental in allowing us to fit an efficient ANN algorithm into the system to provide the entire product catalogue better. From the previous section, we could identify configurations describing different means from any others, as well as complex configurations that did not. By studying the distribution of those configurations, we found the median, the IQR, and the min and max values to hold insignificant value for our use case. FARFETCH Chat R&D aims to deliver relevant products based on the user’s text and image inputs and as such, calls for a higher esteem of quality, assuming the VSE performance meets the project requirements.

HNSW remains one of the top-performing ANN algorithms in terms of speed and quality, both of which can be tuned to benefit one another. While this analysis focused specifically on execution time, it is imperative also to consider the results’ quality. Moving forward, we will be exploring these configurations regarding the relevance of the retrieved products to the user queries.

References

[1] Hierarchical Navigable Small Worlds (HNSW) | Pinecone

[2] Learning Transferable Visual Models From Natural Language Supervision

[3] Efficient and robust approximate nearest neighbour search using Hierarchical Navigable Small World graphs

Originally published at https://www.farfetchtechblog.com on April 12th, 2023.

--

--