How to Use FAISS to Build Your First Similarity Search

Published in

Loopio Tech

5 min readSep 14, 2022

At Loopio, we use Facebook AI Similarity Search (FAISS) to efficiently search for similar text. Finding items that are similar is commonplace in many applications. Perhaps you want to find products in your store that match the description input by a customer. Or perhaps you want to find related questions within your ‘Frequently Asked Questions’ (FAQ) section to show customers. We therefore thought it would be helpful to show you how to set up your first working prototype for similarity search. Before we get into the implementation, let’s look at some definitions.

Word embedding or vector

A vector or an embedding is a numerical representation of text data. For example, using an embedding framework, text like ‘name’ can be transformed into a numerical representation like:

[-1.12725616e-01 -5.19371144e-02 -6.94938377e-02 2.93748770e-02-7.56825879e-02 8.83060396e-02 -5.42510450e-02 -1.52141722e-02]

As humans, we understand the contextual meaning of words like ‘name’, but we need a way to represent this meaning to a Machine Learning (ML) model. A vector representation is just that — language the ML model can understand.

Euclidean distance (L2)

The distance between two points in Euclidean space is called the Euclidean distance. For example, a point in three-dimensional Euclidean space can be located by three coordinates. And the Euclidean distance would account for the distance between all three coordinates.

Tying into the example vector above, which has a dimension of eight, the Euclidean space is also eight-dimensional. If we were finding the distance between two vectors in this space, then using the Euclidean distance formula below:

We will find the distance between each of the eight coordinates of the two vectors, square each coordinate distance, sum all eight squares together and take the square root of the summation.

If words like ‘name’ and ‘cat’ were being compared for similarity, a Euclidean distance can be used to determine their closeness. The smaller the distance, the closer in meaning they are. This is just one example of how similarity distance can be calculated. There are other means, such as cosine distance and FAISS even lets you set a custom distance calculator.

Normalization

Normalization is the process of transforming numerical data so that it uses a common scale. For example, Euclidean distance (L2) normalization scales all vectors to a unit vector, meaning the sum of all elements in that vector equals to 1. This means that the magnitude of the vector caps at 1.0 and the direction is maintained post-normalization.

Building your first prototype

Problem statement: Identify which category a new text can belong to by calculating how similar it is to all existing texts within that category.

We are going to build a prototype in python, and any libraries that need to be installed are mentioned in step 0.

Step 0: Setup

In a terminal, install FAISS and sentence transformers libraries.

pip install faiss-cpupip install sentence-transformers

Step 1: Create a dataframe with the existing text and categories

Here we have a few sentences categorized into 3 unique labels: location, random, and networking.

Your dataframe should have this content:

Step 2: Create vectors from the text

Using the text column of the dataframe, word embeddings or vectors are generated for each row using the Sentence Transformers framework. This is just one of the libraries available for transformation among others like doc2vec.

Step 3: Build a FAISS index from the vectors

Using the dimension of the vector (768 in this case), an L2 distance index is created, and L2 normalized vectors are added to that index. In FAISS, an index is an object that makes similarity searching efficient.

Step 4: Create a search vector

Let’s say we now want to search for the sentence that is most similar to our search text ‘where is your office?’. Using the same method of vectorization from step 2, a search text is transformed. Then the vector is also normalized because all the vectors within the search index are normalized.

Step 5: Search

In this case, we want to search for all nearest neighbours, so k is set to the total number of vectors within the index.

Step 6: Sort search results

In this dataframe, you can see that the distances are sorted in an ascending order. The ann is the approximate nearest neighbour corresponding to that distance, meaning that ann 0 is the vector at position 0 in the index. Similarly, ann 3 is the vector at position 3 in the index — based on the order of text vectors from step 1.

Step 7: Get category for the search text

Similar vectors are defined as those that are nearby in Euclidean space. If we were to pair the distances and ann dataframe with the dataframe in step 1:

Here is what we get:

We can see that text “Where are your headquarters located?” has the shortest distance of 0.5848729 from our search text “Where is your office?”.

We can thus retrieve the category as the category at the position with the lowest distance:

This will give us the category of location for our search text “Where is your office?”.

Sequence Diagram

The steps above are summarized in this diagram.

Conclusion

The MLOps community talks about brute-force similarity search. This is good for starters but would not scale to a Production environment. This is one where you could be making comparisons between hundreds, millions, or billions of texts. FAISS can definitely facilitate that. There are also performance improvements that can further be made based on your use case and needs, so check out FAISS’ github wiki. You can also get a deep understanding of How FAISS works under the hood.

Best of all, if you’re interested in working on this and other meaty problems, check out the career opportunities available across Loopio’s Engineering, Product, and Design teams.