Scalable Semantic Search-S3

Subhadip Paul
Walmart Global Tech Blog
6 min readAug 22, 2023

A Lightweight library for carrying out real time semantic search

Image_source

Introduction

Text search can be of two types — syntactic and semantic. Semantic search captures the meaning of a given text (query) by understanding the meaning and finds contextually similar results. Searching for semantically matching results from a large corpus of text is incredibly challenging because this cannot be achieved by keyword search alone. It requires calculating similarity of vector embeddings of high dimensions. Achieving this in real-time for millions of documents is algorithmically complex and computationally challenging. Amount of memory required to store this huge embedding corpus is an additional obstacle. Consequently we developed a python library in-house to solve for both scale and efficiency problems with semantic search. In this blog we present “S3”, a lightweight package which can perform searches across millions of documents in real time. It also takes care of memory usage by applying efficient dimensionality reduction techniques on vector embeddings while preserving contextuality of the documents. This blog discusses in depth the algorithmic and mathematical approach of the library.

Algorithm

Fig1: Algorithmic flowchart of the library
Fig1: Algorithmic flowchart of the library

Vectorize the Data — The raw text of the corpus needs to be converted to vectors to be able to do a semantic search. The input data corpus is cleaned by removing leading and trailing whitespaces, punctuation and converted to lower- case. The cleaned text is then converted to fixed length vectors using transformer-based language models like SBERT. Each document is converted to vectors of length 384. Users can leverage any model from huggingface to accomplish this task by specifying the name in parameter model_path

Random Projection — The corpus size can run into millions and since each document is 384-dimension length, storing the data and processing the same becomes challenging. To reduce this overhead we apply a dimensionality reduction technique. This reduces the dimension of the embeddings from 384 to specified number of dimension d. We initialise a gaussian random projection matrix of shape 384 x d. Each embedding is multiplied with the projection matrix to get a new vector of length d. In Random projection (RP), original high dimension data is projected to low dimension subspace by a random matrix having column of unit length. RP has been found to be accurate yet computationally less intensive. The dimensions and distribution of random projections matrices are controlled to preserve the pairwise distances between any two samples of the dataset. Thus, random projection is a suitable approximation technique for distance-based methods. projected_dim parameter can be specified to get desired number of reduced dimension vectors. There is a trade-off between amount of memory required for storing the embeddings vs accuracy of the contextual meaning of those embeddings. The projection matrix can be stored using parameter projection_matrix_path . Single Projection Matrix must be created for every index.

Indexing the Data — We have used Annoy to index our text embeddings. Annoy is a data structure which builds large number of trees which partition the entire memory space into homogeneous sub space. This is done by taking any two random data points and generating a hyperplane which is equidistant to these points. This hyperplane will divide the entire space into two subspaces. This partitioning will continue until met with a stopping criteria like minimum number of data points in a node. The main idea is items which are similar will belong to either same node or will be situated nearby in the same space. This partitioning of space into subspace results in a hierarchical binary tree with a root. n_trees parameter specifies the number of trees to be constructed. The trade-off here is accuracy vs response time to search a query. index_save_path provides the path to save the created index file.

Fig 2: Tree based annoy index of a corpus

Mapping of Meta Data — The Index file created in previous step now contains only low dimension projected vectors. In order to return the actual text as a search result we need to have a mapping between index and text. Our library takes care of this by creating a serialised mapping object between vector index and actual text. map_dict_path parameter can be used to specify the location for storing the mapping object.

Building Index — Build index from the raw text files by converting to vectors and create a mapping file between raw text and index

Adding Items to Existing Index — New texts can be added to the already created index. Once added the texts become searchable for any incoming query.

Searching the Index — The Index created can now be used to search for semantically similar items in real time.

Installation:-

pip install https://repository.walmart.com/content/repositories/pangaea_releases/com/ebs/nextech/cerebro/semantic_search/1.0.20-py3-none-any/semantic_search-1.0.20-py3-none-any.whl

Specify the configuration file: -

The Configuration file is a schema design for specifying multiple parameters needed to vectorise data and build an index

from semantic_search.algorithm.schema import SearchConfiguration
search_conf = SearchConfiguration(
model_protocol = 'file', #'file' for running in local setup else 'abfs' for azure run
model_credentials = {}, #Dictionary of Azure blob credentials {account_name:account_key}
model_path = 'ms-marco-MiniLM-L-6-v2.tar.gz', #Sentence Transformer Model name for text encoding
model_type = 'biencoder', ##Type of model, Accepts 'biencoder' and 'crossencoder'
projection_matrix_path = '/random_projection_matrix', #Path to save the random proj. matrix generated **
projected_dim = 64, #Dimentions of random projection matrix
input_file_path = 'input.csv', #File path of input data in csv format
index_file_protocol = 'file', #'file' for running in local setup else 'abfs' for azure run
index_file_credentials = {}, #Dictionary of Azure blob credentials {account_name:account_key}
index_save_path = 'test_index', #Path to save indexed data
map_dict_path = 'map_dict', #Path to save mappings of data with its respective indexes
top_n = 5, #Number of search results to be retrieved
n_trees = 100#Number of annoy index trees to be created. Default, set it to 100 ***
)

Building Index:-

import semantic_search.algorithm.utilities as utils 
utils.create_index(search_conf) #Pass search configuration object</i>

Search Index:-

import semantic_search.algorithm.utilities as utils 
query = "Placeholder for search query"
utils.search(query, search_conf)#Passing query to be searched and search configuration object

Adding Items to Existing Index:-

import semantic_search.algorithm.utilities as utils 
add_str="Add this text to index"
utils.add_items(add_str, search_conf) #Passing str to be added to index and search configuration object

Performance of the Library

For identifying the performance of our semantic search Library, we have deployed the library as a microservice and used Locust, a scalable load testing framework in python, and benchmarked our API performances for various parameters as mentioned in the tables discussed below. As per these tests, API can handle around ~ 34 requests/sec with a 6 cores machine and 6GB memory.

Table 1: Metrics of Load testing for deployed API

Accuracy Benchmarking

Study was done using the Quora duplicate question data set. The dataset contains 400,000 question pairs (see references below). We created index of all “question1”. We searched all “question2” with the expectation that our algorithm will return corresponding question1 if it has a duplicate value. We achieved an overall accuracy of 69%.

Conclusion

The library provides a cheap and efficient way of implementing semantic search. There is some room for improvement in terms of memory optimisation for storing the meta data. We are working to solve this by trying different data structures. Real time indexing is currently not possible using this library since it requires concurrent write to same index file. S3 provides a very efficient framework for building semantic search system in very less amount of time with minimum development effort.

Acknowledgements

  • @Arun Sasidharan Nair for contributing with positive reviews and envisioning the deployment architecture
  • @Panga Mahita Thanks for helping with processing enhancements of the library and putting a well-defined structure to it.

References

Random projection in dimensionality reduction: applications to image and text data.

https://www.kaggle.com/c/quora-question-pairs

--

--

Subhadip Paul
Walmart Global Tech Blog

Data scientist with passion of developing scalable and efficient solutions