Add Similarity Search to DynamoDB with Faiss

Published in

The Startup

10 min readNov 26, 2020

How about setting up a scalable semantic similarity search engine for your website or app, and pay only for the capacity consumed?

What are we making?

By the end of this tutorial, you will be able to search your DynamoDB text entries semantically using a distilled version of Bert. The semantic index will be updated automatically by a microservice to reflect new additions to your DynamoDB table. A second microservice will be responsible for querying the index.

Whether you want your users to search with/for text, image, or audio, the index building/search process is roughly the same. For brevity, we will implement search for text only, however, you can easily replace the text embedding library with another one for your type of data.

What will it take?

I will show you how to set this up using DynamoDB, Lambda, EFS, and S3. Note that all four cloud services are on-demand, i.e. you only pay for what you consume. No idle running services and no need for designing scaling architectures.

Our search engine will be based on vector representations of text, using Facebook’s Faiss library and its Python bindings. This will make our engine “smarter” than a keyword-based engine. If you are not interested in semantics and prefer the old-school keyword solution, then you can replace Faiss with a BM25 library, or use AWS’s ElasticSearch (requires a running instance).

The tutorial is split into two parts. In the first part, we will set up the backend resources and code the index building function. In the second part, we will code a client-facing function to run searches on the index and return the results through API Gateway.

You should have some familiarity with AWS and Python. If you don’t then you must be ready for some googling and reading the external sources included in this article’s links.

The code for this project is available at https://github.com/ioannist/dynamodb-faiss-builder

DynamoDB setup

Create the AWS DynamoDB database that will store your text data. Use whatever primary key makes sense for your case. I will assume that you have a primary key named “content_group” (string), a sort key “content_id”, and one text entry named “content” (string).

If you want to avoid a flat monthly fee, make sure you unclick “Use default settings” and select “On-demand”.

After your table is created, go to Indexes and create an index with primary key “faiss_pending” of type (number). We will use the attribute “faiss_pending” to create a sparse index of all table entries that have not been added to the faiss index.

VPC setup

Our AWS Lambda function will have to run inside a VPC to be able to access an EFS volume. This is not optional for reasons that will become obvious later. Our goal here is to set up a network infrastructure that will host/connect all the pieces together for both index building (private subnet) and index searching (public subnet).

Create an AWS VPC with an IPv4 CIDR of 10.0.0.0/16 in your favorite amazon region (we will use us-east-1). Create three subnets for that VPC with CIDR blocks 10.0.0.0/18, 10.0.64.0/18, and 10.0.128.0/17. The first two subnets will be private subnets with no internet access and the third one will be public. Use appropriate names to be able to distinguish them.

Create two route tables for the VPC. One route table will be for the private subnets and one for the public subnet. Edit your route table associations in your subnets so that one route table is associated with the two private subnets and one with the public subnet.

EFS setup

EFS will provide our lambda function with persistent, scalable storage.

“Why EFS and not S3?”

The code that is deployed on Lambda cannot exceed 250MB (unzipped). Our Python lambda function will require three libraries:

numpy

sentence_transformers

faiss-cpu

We can fit numpy and faiss_cpu in a Lambda layer (to avoid redeploying them every time), but sentence_transformers is over 800 MB. This means that we will have to download it and load it dynamically.

Now, you could download the library from S3 and try to store it locally inside Lambda’s /tmp directory. However, if you try to do that, you will find out that the library does not fit inside /tmp that has a limit of 500MB. EFS to the rescue.

EFS is a virtual drive for your lambda functions. It scales automatically and it can be accessed by multiple functions at the same time. This raises issues of concurrent writing; however, to make our lives easier, we will have only one lambda (with concurrency = 1) build, update, and write the faiss index to EFS. Multiple Lambda functions will be able to read the index for searching. Problem solved.

To make an EFS file system, simply give it a name and select the VPC we made in the previous step. To finish your EFS setup, create an Access Point for the File System you just created with the following settings:

The access point will allow your EFS to attach to your Lambda function inside your VPC.

S3 setup

“Wait, didn’t you say, we will use EFS instead of S3?”

I did and, I lied. We also need S3 so that our function can download some stuff to EFS. Stuff includes the sentence_transformers library and the embedding model that the library will use. Our lambda will only need to download this stuff from S3 to EFS the first time it runs. However, it’s good to leave the files in S3 in case we want to deploy more faiss indexes.

Create an S3 bucket in the same region as your VPC. Create the following folder in the bucket:

ml/libs/python/

ml/models/distilbert_base_nli_stsb_mean_tokens/

Download sentence_transformers from here and upload it in the first S3 directory as sentence_transformers_038.zip. You can also create your own zip file of sentence_transformers by pip installing it inside an Amazon Lambda Linux docker instance.

Download the embedding model and upload it under the second S3 directory as distilbert-base-nli-stsb-mean-tokens.zip. This is the model we will use to transform text into vectors.

If you want to make a search engine for images or audio, you will need a different library and model. For example, for audio search, you could use the infamous wav2vec.

VPC updates

Now that we have S3 and DynamoDB all setup, we need to update our VPC to make sure our Lambda will be able to access both resources through the private network.

Since our lambda is running inside one of the VPC subnets, it does not have internet access to reach S3 or DynamoDB through the internet. We could add internet access by attaching an Internet Gateway to the subnet; however, since this is a private subnet, we choose to access both services through endpoints.

Go to your VPC dashboard and click on Endpoints. Create an endpoint for the service com.amazonaws.us-east-1.s3 and one for com.amazonaws.us-east-1.dynamodb. Select the two private subnets you created for both endpoints. Leave everything else at default.

Finally, we have to create a security group that allows our Lambda function to receive traffic from EFS. In the VPC dashboard, click on security groups and create security group. Select the VPC you created. For outbound, allow all traffic to any destination 0.0.0.0/0, and for inbound, select NFS type and allow to any destination 0.0.0.0/0.

IAM

Let’s create an IAM role for our lambdas. The role will need three policies.

Attach the Amazon stock policy named AWSLambdaBasicExecutionRole to enable logging to CloudWatch.

Next, create an inline policy to grant DynamoDB access for your table. We will need rights to Read->Scan and any other rights your use-case requires.

Finally, create and attach the policy below to allow EFS to be attached to Lambda.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeNetworkInterfaces",
                "ec2:CreateNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DescribeInstances",
                "ec2:AttachNetworkInterface"
            ],
            "Resource": "*"
        }
    ]
}

Lambda

Create a new layer named faiss-cpu, for Python 3.6, 3.7, 3.8. You can download the file for the layer here. The zip file includes faiss-cpu and numpy. The zip file is 30MB and the uncompressed contents are under 250 MB so we are good. If you want to compile your own layer to use the latest versions or add additional libraries, follow the directions here.

Create a new Function with Python 3.7 runtime and the role you created in the previous step.

Attach the layer you created.

Set Lambda concurrency to 1.

Set the function’s memory to 2048MB and the timeout to 5 minutes (you can increase this later to 15 min; however, 5 min will save you time during debugging).

Edit the functions VPC and select the VPC, the two private subnets, and the security group you created.

Add an EFS to your lambda by selecting the EFS you created, its access point, and entering /mnt/efs as the local mount path. If you are having permissions problems in this step, make sure the security group allows EFS inbound traffic through port 2049 from any source and allows all outbound traffic, that your route tables allow all traffic (default), and that your EFS endpoint has the right settings (refer to the image above).

Lambda CloudWatch trigger

Finish your lambda set-up by adding a trigger to have it run every X minutes or hours, depending on your requirements. For example, to have it run every day, add an EventBridge (CloudWatch Event) of rate(1 day).

Your function will only run for 100ms, enough to check if new records have been added to DynamoDB. If no records are added, then it will terminate immediately.

You also have the option to create a DynamoDB trigger to run the function on DynamoDB events. For most use cases, a search index that is at most X minutes outdated should be ok, so a DynamoDB trigger will increase cost and complexity for no reason. Note that even though the CloudWatch trigger will run even when there is nothing to do, the cost will be insignificant because of the short duration.

Python code for Lambda

Our index-building lambda function needs to do the following.

The first time the lambda runs, it must:

A. Download the Bert model, if the model is not available in EFS storage

B. Download the sentence_transformer library if it not available in EFS

C. Create a new faiss index if there is no serialized index to load from EFS

The function must also do the following when it initializes after a period of inactivity:

I. Make the sentence_transformer library available to Python by adding it to the sys.path

II. Load the faiss index from EFS if it is available

The above processes take about 20 seconds. If you keep your Lambda “warm” by running it every 5 minutes, you will experience this delay rarely. However, if you run your function every ~20 minutes or more you will experience the delay every time.

Finally, once all prep work is done, the function must do the following on every invocation:

1. Get new table records by scanning the DynamoDB sparse index that has the key faiss_pending

2. Calculate the vector embedding for each “content” string from the scan results.

3. Add the embeddings to the faiss index and save the index back to EFS

4. Update the DynamoDB table records be removing faiss_pending to remove them from the sparse index and avoid reprocessing

Programming FaissIndexMaker Lambda

A. Download the Bert model, if the model is not available in EFS storage

B. Download the sentence_tarnsformer library if it not available in EFS, and

I. Make the sentence_transformer library available to Python by adding it to the sys.path

C. Create a new faiss index if there is no serialized index to load from EFS, and

II. Load the faiss index from EFS if it is available

Get all new table records by scanning the DynamoDB sparse index that has the key faiss_pending

2. Calculate the vector embedding for each “content” string from the scan results.

3. Add the embeddings to the faiss index and save the index back to EFS

4. Update the DynamoDB table records be removing faiss_pending to remove them from the sparse index and avoid reprocessing

If you prefer to look at the full code, check out the GitHub repo.

You can zip and upload the py files manually to Lambda as all imports are taken care of by the lambda layers and EFS.

If you would like to add extra libraries, you can add them in requirements.txt and they will be included in the zip file. In that case, you should install your imports inside Lambda’s native environment that is provided by Docker. Check the Dockerfile, deploy.sh, and deploy-lambda.bat in the repo if you are curious.

Recap and next steps

Congrats! You have just upgrade your database to having a semantic similarity index and you don’t have a single server to manage! You’ve also managed to work around AWS’s limitations by integrating various technologies. Pat yourself on the back. It was no easy feat.

In the next article, we will set up a Lambda search function so we can query the index from a public API. We have already done most of the work, so this will be easier and more rewarding.

Food for thought

Our microservice can index one entry per 0.75 seconds using CPU only. What if our database is growing faster than this on average? How can we manage concurrent write access to a single faiss index file? More importantly, should we even bother?

If our Lambda is to be working 24/7 to catch up, wouldn’t it be much cheaper to run a server? If we do decide to run a server, would it be cheaper to attach a GPU, use faiss-gpu (instead of faiss-cpu) and run it only a few times per day?