Similarity Search and Similar Image Search in Elasticsearch

There are some benefits if a managed search engine supports efficient high dimension vector search. In this article, the kNN search feature in Amazon Elasticsearch Service is evaluated.

4 min readMar 28, 2020

Introduction

Recently, AWS published this blog post, Build k-Nearest Neighbor (k-NN) similarity search engine with Amazon Elasticsearch Service, that supports lightweight similarity search with Non-Metric Space Library (NMSLIB), which is one implementation of HNSW. The plugin, which enables the kNN, is open sourced, k-NN proximity algorithm for Open Distro for Elasticsearch.

Built using the lightweight and efficient Non-Metric Space Library (NMSLIB), k-NN enables high scale, low latency nearest neighbor search on billions of documents across thousands of dimensions with the same ease as running any regular Elasticsearch query.

There are some benefits if a managed search engine supports efficient high dimension vector search. Even though the similarity search score and scores of other types of fields, like text and categorical ones, cannot be merged in the scoring function, a post filter for the documents returned from the similarity search is easily implemented.

If the kNN search feature is very useful, I want to consider using it for some real world applications.

The following are the conditions I hope to meet:

Vector dimensions: 1,000+
#of vectors: 1,000,000+
Server spec: AWS EC2 r5.large (2 Cores CPU, 16 GB Memory)
# of servers: 1
Latency: 10ms — 20ms

Data

DeepFashion and DeepFashion2 are used in this article. Some image are duplicated, but it’s not big issue for the evaluation. The total number of images is around 1M (991,257)

Feature Extraction

The MobileNetV2 pre-trained on ImageNet is used simply. The vector dimension is 1,280.

https://gist.github.com/kumon/4efb7133dc0e2474f05738bb6c895f2e#file-feature_extraction-py

Prior Evaluation

Faiss, which is a famous similarity search library, also has HNSW implementation, so let’s see the performance and do parameter selection. Since cosine similarity is returned from Amazon Elasticsearch Service, the vectors are normalized so that the L2-norm is 1 and the returned L2 distance is transformed to cosine similarity in this code.

https://gist.github.com/kumon/eda990b7f41e0ed505bf3be2538ebded#file-pre_eval_faiss-py

Search Latency

In the experiment, very practical search latency was achieved. Although I didn’t carry out accurate recall evaluations, the parameters, which are m=48, efConstruction=128, and efSearch=256, returned practical results. In this experiment, only one r5.large instance was used.

If the Amazon ES allows us to have similar results, I really want to use it in some applications.

Similar Image Search Result

The most left column is the query images. The other columns are the search results sorted by similarity. The result is very interesting and amazing. Even though the ImageNet pre-trained model is not optimized for fashion images, the result seems practical.

Amazon Elasticsearch Service

Through the AWS console, elasticsearch instances are launched with several clicks. There is no special option to enable the kNN feature. It takes 10–15 minutes for the service to be available. Only one r5.large instance was used the same as the prior evaluation above.

Data Insert

It takes around one hour to insert the 1M vectors into the elasticsearch server.

https://gist.github.com/kumon/cd9efa1826a8a18e850219ac932a68fd#file-data_insert-py

We can check the number of searchable documents through this code.

print(es.cat.indices(v=True))

Search

https://gist.github.com/kumon/2b85746cdd2185b3bfd8d57675735278#file-search_es-py

I confirmed that the returned documents/images were the same as the prior evaluation’s ones above.

However, I got an unbelievable latency every request, like 15 seconds. I tried warming it up by sending some queries, increasing the number of servers, and etc…

Unfortunately, the latency was always extremely long.

{
  "took": 15471,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1203,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "imsearch",
        "_type": "_doc",
        "_id": "340460",
        "_score": 1.0
      },
      {
        "_index": "imsearch",
        "_type": "_doc",
        "_id": "355432",
        "_score": 0.6760856
      },
...

Conclusion

If the evaluation is correct, at this moment, I think similarity search with elasticsearch seems to have some restrictions like vector dimensions. For image search, dimension reduction, like PCA, must work properly. For some other sorts of embeddings, it may not work well.

I hope the similarity search feature on elasticsearch will be improved and practical soon. If billion scale data including high dimension vector fields can be handled properly in the service, many real world applications would rely on it.

Update

(Apr. 12, 2020)
The Amazon Elasticsearch Service team helped me to realize practical similarity search with the kNN feature in the service. I will share the contents in another blog post soon.

(Apr. 14, 2020)
How to Realize a Practical Similarity Search with Elasticsearch