How to Realize a Practical Similarity Search with Elasticsearch

The latency issue of the kNN feature in the Amazon Elasticsearch Service was mentioned in the previous post. Thanks to the support from the Amazon ES team, practical configurations and restrictions became clear.

Takuma Yamaguchi (Kumon)
4 min readApr 14, 2020

Introduction

In the previous post, I put around 1M vectors, whose dimension is 1280, into the elasticsearch. However, the query latency was not practical, like 15 secs for each query request. The Amazon ES team advised me to configure elasticsearch for such purpose properly.

For software engineers who are familiar with elasticsearch, it may be very simple and basic. Since I never used elasticsearch for commercial purposes, it was very helpful for me.

Segments

Looking at the beginning of the response from elasticsearch in the previous post, the number of hits is 1203. This means that there are at least 120 segments in the elasticsearch index. For each segment, k candidate vectors are extracted, then, the most similar k vectors are chosen as the result. In the experiment, the k was 10.

{
"took": 15471,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1203,
"relation": "eq"
...

More precisely, an elasticsearch index is divided into shards, each of which is a Lucene index. Each Lucene index is divided into multiple files, called segments. Since the segments are scanned sequentially, the less segments bring better performance.

By default, the number of shards for each index is 5, so 5 segments in total is the ideal number.

Configurations

To improve indexing performance, following settings are added.

  • index.refresh_interval = -1 (default: 1 sec)
  • index.translog.flush_threshold_size = ‘10gb’ (default: 512mb)
  • index.number_of_replicas = 0 (default: 1)

In addition to it, increasing the number of indexing threads is also recommended. These articles are the related ones:

The following is the updated code for indexing vectors.

https://gist.github.com/kumon/cd9efa1826a8a18e850219ac932a68fd#file-data_insert-py

Merge segments

After running the code, the number of segments was 27, it’s not 5. We have to merge the segments through:

POST /imsearch/_forcemerge?max_num_segments=1

It takes time, so we keep monitoring the response every few minutes until we receive a 200 response. Ultimately, the number of segments will be 5.

Refresh

Since refresh is disabled by index.refresh_interval = -1,

POST /imsearch/_refresh

has to be called. Then, finally, the similarity search is available.

However, unfortunately, even with the updates, the response time was over 7 secs. Although it’s better than 15 secs, it’s still not a practical response time. Warmup requests were not helpful for this the same as before. As we can see, the number of hits is 50. That means each shard has only one segment.

{
"took": 7269,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 50,
"relation": "eq"
},
...

Memory

The instance type for the elasticsearch cluster was r5.large which has 2 core CPU and 16GB memory. 50% of the memory (up to 32GB) is allocated to the elasticsearch itself. And a part of the remaining memory is available for the similarity search. To keep the system running stably, a circuit breaker, which limits memory usage, is implemented. It’s configurable by knn.memory.circuit_breaker.limit and the default value is 60%. So, only 16GB * 50% * 60% = 4.8GB is available.

I tried to put one million vectors and the vector dimension was 1,280. The HNSW requires 4 * d + 8 * M Bytes for each vector for just storing the graph. The Amazon ES team recommended me to have 1.5 times larger memory than the required amount. The M was 48 in the experiment, so (4 * 1,280 + 8 * 48) * 1.5 * 1M = 7.7GB is needed for the dataset.

I created another elasticsearch cluster with r5.xlarge (4 core CPU and 32GB memory), i.e. 32GB * 50% * 60% = 9.6GB is available for the similarity search. Since the number of cores is 4, knn.algo_param.index_thread_qty is updated to 4 as well.

After inserting the vectors and updating the index with the same procedure above, 1,000 requests were sent to the cluster. Then, finally the similarity search worked with practical response time, which was 14ms in average!!

ANN with nmslib

In the previous experiment, since I usually use the faiss for ANN, it’s used also for the HNSW benchmark. I also used nmslib directly to compare the search performance. To meet the elasticsearch memory requirement, r5.xlarge instance was used for the cluster in the experiment. So, the same instance type was used also for this benchmark.

Search latency with faiss and nmslib

Although the average response time with nmslib was 2 times faster than the elasticsearch kNN, it’s not an apple to apple comparison. In many cases, the latency would be acceptable.

Conclusion

Thanks to the Amazon ES team, I finally realized the similarity search with practical response time by reducing the number of segments and allocating sufficient memory. HNSW is a wonderful ANN algorithm. Since the algorithm keeps original data, the memory consumption is higher than other algorithms, like the Inverted File with Product Quantization (IVFPQ), if the vector dimension is large like over thousands, it seems better to consider dimension reduction before putting such vectors into elasticsearch. The HNSW is known as one of the best ANN algorithms for million-scale data. Using an elasticsearch cluster, which consists of multiple servers, may enable us practical ANN search also for billion-scale data.

--

--