Semantic Vector Search: Tales from the Trenches

Marco Angel Bertani-Økland
Compendium
Published in
10 min readSep 8, 2020

This blog post describes our recent experience implementing semantic vector search in a customer case. Semantic vector search is a way to spice up your search with some machine learning magic. We report on our findings about the advantages and disadvantages of this technique, and the performance gains in accuracy you can expect.

TL;DR

Semantic vector search (also known as neural search or neural information retrieval) is a good technique to have in your toolbox. It will definitely help you squeeze out more performance points from your working solution, but it comes at a cost. Make sure you analyze carefully the trade-offs for your use case to see if this solution is worth the additional complexity.

Combining machine learning with traditional information retrieval algorithms seems to work better than each approach alone.

The customer

Our customer is one of the largest ship chandlers/ship suppliers in the world and coordinates global activities through regional centers in Europe, Far East, Middle East and North America. Their core business is to provide a 24/7/365 service for every marine, offshore and navy operation, including land operations. They are a full service provider, including handling of owners’ goods, shipping, airfreight and related marine services. Their mission is to make it easy for their customers to receive their supplies, wherever they are needed, efficiently and at the best possible price.

The problem

The company has a product catalogue with several thousand products. The users submit a “request for quote” (RFQ), which is like a “shopping list”, that is, a list of items having the quantity, and a textual description of the item. Given this RFQ, they have to match the textual description of each item in the list to the products available in their catalogue. Is there a way we could automate the process?

The promise of semantic vector search

So why would we want to implement semantic vector search at all? Well, classic search implementations start with keyword search. To improve search relevance, one starts to manually fine-tune the search ingestion pipeline (analyzers, token filtering, synonyms) or the search queries (boosts, query expansion, domain specific business rules, etc). But these approaches have their shortcomings, and scaling the manual process can become difficult.

Furthermore, one of the hardest challenges of implementing a good search solution is to understand what your users mean (query intent) and expect when searching. Advanced techniques like query classification, semantic query parsing, knowledge graphs, and personalization can help you. Powering your search with AI techniques can help you to both automatize the relevance tuning process (and be able to scale) and help you understand better your users.

Semantic vector search is an example of AI powered search. It can enable you to encode documents (even in different languages), pictures, and videos in the same space, and let you search across these types. By encoding queries in the same space, it allows you to search by what you mean, and not only by what you type.

The approach

There are several ways one could use to approach the problem. But we thought that the simplest baseline we could produce is to index the product catalogue in a search engine, like Elasticsearch, define a good query function and look at the top result. From the historical data, we could build a benchmark dataset to measure the accuracy of the top result by using the users’ RFQs as queries. The building of this dataset is a complex process, and deserves its own blog post.

Once we have that baseline, there are several ways to build on and improve the solution, like building a better signal from your data, using synonyms specialized for the domain in question, trying more complex queries in Elasticsearch and so on. But what we really wanted to find out is the best way to use Semantic Vector Search for this use case. More concretely:

  • Is it better to just use Semantic Vector Search without the traditional BM25/TF-IDF algorithms?
  • Or should we use the traditional BM25 and use Semantic Vector Search for rescoring the top X results for a query?
  • What are the performance gains for each approach? What are the costs of implementing each solution?

What is Semantic Vector Search?

Search or information retrieval is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Typically, a user enters a query representing the user information needs, defined by keywords. The query does not (usually) uniquely identify a single document, but several, having different degrees of relevance.

Documents are indexed into a searchable database, in order to retrieve them at query time. One way of indexing documents is by using a data structure called the inverted index.

Figure 1. Inverted index.
© https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Another way to do this is by embedding the documents into a vector space. You convert a document into a sequence of numbers (a vector) in a space with a certain structure, in a way that it preserves the semantics of the data. The important bit is that documents having similar meaning will be mapped to vectors close to each other. The illustrated Word2vec is a nice reference if you want to grasp this transformation in more detail. We call these vectors, the semantic vector representation of the given documents. There are of course several ways to produce embeddings from documents. You can encode subwords, words, sentences, paragraphs and whole documents. Choosing the correct embedding technique is highly important, and depends on your use case.

Semantic Vector Search is just looking for the closest semantic vectors to a given query in the vector space. This requires that you also convert the query into a vector, before you can do this operation.

Testing semantic search

To test our hypothesis, we need to:

  1. Create a benchmarking dataset Benchmark from the historical records.
  2. Clean the data (lowercasing, removing stop words, removing clutter words).
  3. Remove duplicates. We ended up with 588,500 records. The fields are:
    - query: the textual description of the item from the user
    - category: the category of the item, Food or Technical.
  4. Add an embeddings field from the query field. We used sentence-transformers roberta-large-nli-stsb-mean-tokens’s language model to transform the field query into a vector. This process can be time consuming (around 10 hours in a Linux 64bit OS with 8 CPU cores, 32 GB RAM and kernel version 5.3.0–62-generic).
  5. Create the product catalogue ProductCatalogue dataset.
  6. Clean the product catalogue (lowercasing, removing stop words, removing duplicates). The fields are: product_id, description, category.
  7. Add an embeddings field from the description field. We used sentence-transformers roberta-large-nli-stsb-mean-tokens’s language model to transform the field description into a vector.
  8. Index the product catalogue into Elasticsearch. To have some control over the reproducibility of the scores, you need to set up num_primary_shards to 1, num_replicas to 0, and sort the search results by the Description field. You will still observe variations in the score if you reindex your dataset, but these will be smaller.
  9. Define the queries for each approach: Semantic Vector Search, the baseline and baseline with semantic vector search rescoring.

Semantic Vector Search (SVS)

Here we use the script_score query from Elasticsearch to implement the Semantic Vector Search. Using the Benchmark dataset, we use the following parameters:
_query_embeddings: Benchmark.embeddings
_query_sort_column: Column to sort results by.

Script score to implement semantic search. Use sort for reproducibility of the ranking.

Baseline

To implement the baseline search, we use the multi_match query from Elasticsearch. Using the Benchmark dataset, we use the following parameters:
_query: Benchmark.query
_query_sort_column: Column to sort results by.

Multi_match query to implement the baseline search. Use sort for reproducibility of the ranking.

This is in fact a boolean query wrapping a multi_match query. In our final implementation, we use several fields to build the relevant signal for search. This query can be rewritten to a simple multi_match query.

Baseline with SVS rescoring: baseline_svs

Here we use a multi_match query with rescore, to recalculate the ranking from the baseline query with the semantic vector search.
Parameters:
_query: Benchmark.query
_query_embeddings: Benchmark.embeddings to do the rescoring.

Rescoring with Semantic Vector Search. Unfortunately it is not possible to sort when using a rescore query in Elasticsearch. Items with exact same score will not have a reproducible ranking.

We apply the baseline query, take the top 50 results and apply a rescore based on vector similarity search (this is not part of the open source functionality of Elasticsearch, you need at least a basic license to use this feature). You will notice the “rescore_query_weight” parameter, where we increase the score of the queries by a factor of 1.5 and multiply it with the previous score (via the “score_mode” parameter). The reason is that we want a big gap in the scores at the top results. One can definitely do a hyper-parameter search to find better values for these parameters, but this requires a more complex setup where you need to split your benchmark dataset in train/test parts in order to evaluate results correctly.

Example queries

Let’s look at an example query to get a feeling of the results we can expect with each technique. Let’s perform the query:

jordan radiator paint brush 50mm -2"

Where the correct hit is:

brush radiator angle (dog leg) 50 mm jordan qa

Here we present the top 10 results by the three techniques:

Semantic Vector Search (SVS). The correct result has rank 47, thus not shown in the this list.
Baseline search. The correct result has rank 7. Observe that items 7 and 8 have same score.
Rescore with SVS. The correct result has rank 7. Observe that items 7 and 8 have the same score.

What we see from the results is that SVS failed to retrieve the correct document within the top 10. Baseline and baseline_svs behaved similarly for this query, with a few items changing the order in the top 5.

The results

The following table summarizes the results from the experiments. This table was obtained by running each experiment five times and taking the average of the results. Each experiment consists of indexing the data in Elasticsearch, running the queries for each algorithm and calculating the statistics. We also show the standard deviation to have an idea of the spreading of the values. We tried to account for reproducible results (with reproducible sorting, merging all Lucene segments into one, testing across different machines), but at the end, we still observe different results when reindexing the data into Elasticsearch. It is a known issue that scores are not reproducible.

Table 1. Average results with standard deviation for each technique. The results represent the percentage of queries where the correct answer was in the top k result set, for k = 1,3,5 and 10. We present the average “avg” and the standard deviation “stddev” for 5 runs of each experiment.
Table 2. Average runtime on a Linux 64bit OS with 8 CPU cores, 32 GB RAM and kernel version 5.3.0–62-generic. We used a single node elasticsearch 7.6.1 with 3GB heap size in a docker container, with a SSD hard disk, and used 1 primary shard and 0 replicas for the index.

From the tables we can make the following remarks:

  1. The variations between each run are small. The variations diminish with the size of the batch for the top results. We believe that there are not so many documents that will have equal scores for a given query in our benchmark dataset.
  2. Semantic Vector Search (SVS) alone performs worse than the simple baseline (Baseline). Empirical experience shows that semantic search has a higher false positive probability than BM25. This effect is more extreme when the dataset consists of lots of unrelated documents, like the product catalogue we are working with.
  3. Adding a rescoring step with Semantic Vector Search (Baseline_svs) gives a solid increase for the precision at k, for k = 1, 3, 5 and 10.
  4. SVS is quite costly, as currently implemented in Elasticsearch. Using annoy, faiss or milvus may improve performance (at the cost of functionality, since these tools just offer similarity search, and their APIs are not as rich as of Elasticsearch). There is also a considerable cost to do the rescoring over the baseline depending on the size of the rescoring window, as Table 2 shows.

We can conclude that, for this dataset at least, rescoring with Semantic Vector Search gives the best results.

One can wonder if you can expect such performance gains when building a more complex solution. To test this idea, we also benchmarked the final solution we delivered to the customer versus adding a rescoring step with SVS. Here are the results:

Table 3. Average results for the final version. The results represent the percentage of queries where the correct answer was in the top k result set, for k = 1, 3, 5 and 10. We present the average “avg” and the standard deviation “stddev” for 5 runs of each experiment.
95% Confidence interval for “% top 1” final version: [60.228, 60.809]
95% Confidence interval for “% top 1” final version with SVS: [61.055, 61.854]

As we can see, there is a significant increase in performance (seen from the confidence intervals not overlapping) but the performance gains were not so dramatic as for the baseline case. We hypothesize that this is due to the previous optimizations done in the final version. It is worth mentioning that we haven’t done any fine-tuning of the language model to this domain. This is something one could have tested with more time available.

A final remark about the benchmarking dataset is that it is not possible to build a solution with 100% precision. This is because the historical data is noisy. An example of this is a query where “hygiene product nr 2” has a correct match as “head & shoulders shampoo 200 ml”. Further work is needed to curate and clean the dataset with queries that are meaningful, in the sense that they carry an information signal sufficient to find the correct match. Unfortunately, this requires much manual work and doesn’t scale.

Conclusions

Semantic Vector Search is a new and exciting technique which is starting to mature for production scenarios. Here are some of our lessons learned through this customer case:

  • A simple search query goes a long way. Don’t overcomplicate your search queries from the start. Start with a simple baseline and increment complexity gradually.
  • Create a benchmark dataset. It will help you be data driven in your performance optimizations. But still, always keep in mind what part of the experience your benchmark is not covering.
  • Reproducibility is hard. Always reindex your data when benchmarking with Elasticsearch.
  • Semantic Vector Search is costly. It takes time to build the vector embeddings, and it takes time serving the results (since you will need to convert the incoming queries to vectors too). You need to have a good understanding of your latency constraints to scale up accordingly when serving. Changes in your ingestion pipeline will also take more time when recalculating the embeddings.
  • Relevance tuning is hard with Semantic Vector Search. At the moment, it is hard to explain the ranking results from Semantic Vector Search. This is important when trying to fix relevance issues. Here is an example of how to explain the search results for normal BM25.

From our experiments with this dataset, the usual BM25 powered with Semantic Vector Search was the best performant solution. This approach is relatively straightforward to implement in your current search algorithm. But should you do it? Well, as always, it depends! I guess it all comes down to the question of:

  • How many expected RFQs will you win by using this technique?
  • How much performance gains are you willing to trade for the increase in complexity of your search pipeline?
  • How important is it for you to be able to explain the results of your search engine?
  • What kind of control do you need over the search relevancy for your users?
  • Can your users tolerate the added increase in query execution time?

Well, that sums it up. If you want to connect with me and discuss more about search or machine learning, don’t hesitate to reach out.

Acknowledgments

I would like to thank Josephine Honoré for helping me to reproduce and discuss the results. I also want to thank Morten Forfang, David Skålid Amundsen and an anonymous manager from our customer, for proof-reading and commenting on earlier drafts of this document.

Useful links

--

--

Marco Angel Bertani-Økland
Compendium

Mathematician and software engineer working within the realms of Software Engineering, ML, and Ethics. Trying to build a sustainable future with technology.