Member-only story

Building a k-NN Similarity Search Engine using Amazon Elasticsearch and SageMaker

Step-by-Step Guide to Building Efficient and Scalable Document Similarity Search Engine

Yi Ai
Towards Data Science
4 min readMar 25, 2020

--

Photo by NeONBRAND on Unsplash

Amazon Elasticsearch Service recently added support for k-nearest neighbor search. It enables you to run high scale and low latency k-NN search across thousands of dimensions with the same ease as running any regular Elasticsearch query.

k-NN similarity search is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch.

In this post, I’ll show you how to build a scalable similarity questions search api using Amazon Sagemaker, Amazon Elasticsearch, Amazon Elastic File System (EFS) and Amazon ECS.

What we’ll cover in this example:

  • Deploy and run a Sagemaker notebook instance in VPC.
  • Mount EFS to notebook instance.
  • Download Quora Question Pairs dataset, then map variable-length questions from dataset to fixed-length vectors using DistilBERT model.
  • Create downstream task to reduce embedding dimensions and save sentence embedder to EFS.
  • Transform questions text to vectors, and index all vectors to…

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Yi Ai
Yi Ai

Responses (1)