Use FAISS to Build Similarity Search

Pratik Goutam
6 min readMar 20, 2024

--

Image by rawpixel.com on Freepik

FAISS, short for “Facebook AI Similarity Search,” is an efficient and scalable library for similarity search and clustering of dense vectors. It is developed by Facebook AI Research and is widely used in various applications such as machine learning, natural language processing, computer vision, and information retrieval.

It is commonly applied in scenarios where large collections of vectors need to be efficiently indexed and searched for similarity, such as:

  1. Recommendation Systems: Faiss can be used to build recommendation engines where items or products are represented as vectors, and similar items are recommended based on user preferences.
  2. Image and Video Search: Faiss is utilized to index and search through large databases of images or videos, enabling content-based retrieval where similar visuals are retrieved based on query images or videos.
  3. Natural Language Processing (NLP): In NLP tasks, Faiss can index word embeddings or sentence embeddings to quickly find similar words or sentences, facilitating tasks like semantic search or document similarity analysis.

In this article, we will show how we can use FAISS for similarity search in a database of books and description. Based on user query, we will return relevant matching books.

Before we get into the implementation, let’s look at some definitions.

Word embedding or vector

A vector or embedding serves as a numerical representation of textual data. For instance, with an embedding framework, a text such as ‘FAISS’ can be converted into a numerical representation like:

[0.12345, -0.67891, 0.23456, 0.78901, -0.34567]

While we, as humans, grasp the contextual significance of words like ‘FAISS’, we require a method to convey this meaning to a Machine Learning (ML) model. A vector representation accomplishes precisely that — providing language that the ML model can comprehend.

Normalization

Vector normalization is a process used to scale vectors to have a unit length or magnitude of 1 while preserving their direction. It involves dividing each component of the vector by its length or magnitude. This technique is commonly employed in machine learning tasks, where normalized vectors simplify distance calculations.

Before normalization: [3.2, -1.5, 0.8, 2.1, -4.5, 2.9, 1.7, -0.3]

After normalization: [0.351, -0.164, 0.087, 0.230, -0.493, 0.319, 0.187, -0.033]

Vector normalization offers several benefits in machine learning and data analysis tasks:

  • Standardization of Magnitudes: Normalizing vectors ensures that their magnitudes are consistent across different data points, making comparisons and distance calculations more reliable.
  • Simplified Distance Metrics: Normalized vectors simplify distance calculations, especially in algorithms like cosine similarity and dot product, where the length of the vectors directly affects the result. With normalized vectors, these calculations become more intuitive and efficient.
  • Improved Model Performance: Many machine learning models, such as k-nearest neighbors (KNN) and support vector machines (SVM), rely on distance metrics to make predictions. Normalizing vectors can lead to more accurate and stable model predictions by reducing the impact of variations in vector magnitudes.

Euclidean distance (L2)

Euclidean distance, also known as L2 distance, is a measure of the straight-line distance between two points in Euclidean space. It is the length of the line segment that connects two points in a plane. Mathematically, the Euclidean distance between two points P and Q in n-dimensional space is calculated using the Pythagorean theorem:

d(P, Q) = sqrt{(q_1 - p_1)² + (q_2 - p_2)² + ... + (q_n - p_n)²}

where p_i and q_i represent the i-th coordinates of points P and Q respectively, and n is the number of dimensions.

The Euclidean distance is widely used in various fields such as machine learning, computer vision, and data analysis as a measure of similarity or dissimilarity between data points.

When comparing words like ‘dog’ and ‘cat’ for similarity, one can use Euclidean distance to gauge their closeness. A smaller distance indicates a closer meaning between the words. However, this is just one method for calculating similarity distance. Other approaches, like cosine distance, are also used. Additionally, FAISS allows customization of the distance calculator to suit specific requirements.

Building your first prototype

Problem statement: Given a database of books along with their description, suggest a user top 3 books which best suits their use-case and requirements.

We are going to build a prototype in python, and any libraries that need to be installed are mentioned in step 0.

Step 0: Setup

In a terminal, install FAISS and sentence transformers libraries.

pip install faiss-cpu
pip install sentence-transformers
pip install pandas

Step 1: Create a dataframe with books and their Description

I got the sample dataset from gigasheet. You can sign-up for free and download the data as well. I am using top-500 records from this dataset for this prototype. Here we have a list of books along with it’s description (books.csv).

Now, we create a dataframe using this csv file.

df = pd.read_csv('books.csv')

Step 2: Create vectors from the text

Using the Description column of the dataframe, word embeddings or vectors are generated for each row using the Sentence Transformers framework. This is just one of the libraries available for transformation among others like doc2vec.

book_descriptions = df['Description']

encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
vectors = encoder.encode(book_descriptions)

Step 3: Build a FAISS index from the vectors

An L2 distance index is created using the dimension of the vector (768 in this case), and L2 normalized vectors are added to this index. In FAISS, an index is an object designed to facilitate efficient similarity searching.

vector_dimension = vectors.shape[1]
index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
index.add(vectors)

Step 4: Create a search vector

Suppose we aim to find the top 3 books that are most similar to our search query “recommend a book on vampires”. We follow a similar vectorization process as in step 2 to transform the search text into a vector. Subsequently, the vector is normalized to align with the normalized vectors within the search index.

query_description = input("Enter Search Query: ")
# Perform a search in Faiss index
query_vector = encoder.encode([query_description]) # Encode the query description
faiss.normalize_L2(query_vector)

Step 5: Search

In this case, we want to search for top 3 nearest neighbours, so k is set to 3.

k = 3
D, I = index.search(query_vector, k)

Step 6: Display Recommended Books

After completing the search, we present the top 3 recommended books along with their L2 distances. It’s important to note that a lower distance indicates a better match.

similar_books = []
for d, idx in zip(D[0], I[0]):
title = df.iloc[idx]['Title'] # Get the title using the index
similar_books.append((title, d)) # Append title and similarity score

# Print the titles and similarity scores of similar books
print("Similar books:")
for title, score in similar_books:
print(f"Title: {title}, Similarity Score: {score}")

Step 7: Run the Prototype

Run the above created Python file and you should see a prompt asking for search query in the terminal. Enter your desired query and you should get recommended 3 books as per their matching descriptions.

Prototype Terminal Output

Description of Book Fangs reads —

A love story between a vampire and a werewolf by the creator of the enormously popular Sarah\’s Scribbles comics.\n\nElsie the vampire is three hundred years old, but in all that time, she has never met . . . relatable relationship humor, Fangs has all the makings of a cult classic.

Conclusion

In conclusion, FAISS offers a powerful solution for similarity search tasks in large-scale text datasets. Its efficient indexing and search capabilities make it well-suited for production environments where comparisons between hundreds, millions, or even billions of texts are needed.

Additionally, FAISS provides opportunities for performance optimizations tailored to your specific use case and requirements, ensuring that you can effectively leverage its capabilities to meet your needs., so check out FAISS’ github wiki. You can also get a deep understanding of How FAISS works under the hood.

You can find the Git repository for the above code here.

--

--