Member-only story
Building a k-NN Similarity Search Engine using Amazon Elasticsearch and SageMaker
Step-by-Step Guide to Building Efficient and Scalable Document Similarity Search Engine
Amazon Elasticsearch Service recently added support for k-nearest neighbor search. It enables you to run high scale and low latency k-NN search across thousands of dimensions with the same ease as running any regular Elasticsearch query.
k-NN similarity search is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch.
In this post, I’ll show you how to build a scalable similarity questions search api using Amazon Sagemaker, Amazon Elasticsearch, Amazon Elastic File System (EFS) and Amazon ECS.
What we’ll cover in this example:
- Deploy and run a Sagemaker notebook instance in VPC.
- Mount EFS to notebook instance.
- Download Quora Question Pairs dataset, then map variable-length questions from dataset to fixed-length vectors using DistilBERT model.
- Create downstream task to reduce embedding dimensions and save sentence embedder to EFS.
- Transform questions text to vectors, and index all vectors to…