NLP — Efficient Semantic Similarity Search with Faiss (Facebook AI Similarity Search) and GPUs

Published in

Lett

7 min readOct 19, 2021

Similarity search has never been more present in everyone’s lives — more often than ever, we face information repositories where objects don’t possess any natural order(e.g. large collections of images, text documents, sounds), and the only available comparator is the similarity between any pair of objects. With Big Data, this type of search becomes an exponentially harder problem to solve, and one has to be equipped with the latest top techniques.

A few weeks ago, I stumbled upon the challenge of matching millions of products in e-commerce together, when they are essentially the same item but announced on different pages. The problem at hand is not the scope here, but It’s a great one to illustrate the following use case — I couldn’t frame It as a traditional multiclass classification problem, since products might come and go every day and there aren’t always many training samples for each one of them. As my MVP, what I found as the smartest way to go about It was to build an intelligent system, capable of generating semantically meaningful vector representations of product titles, and use some sort of vectorial similarity search to match the most similar ones — and It all had to be scalable, thus, efficient. To achieve that, I had to dig deep into the world of semantic similarity search in the field of Natural Language Processing, acquiring a lot of valuable learnings that I would love to share in this article.

Firstly, I want to briefly present a way of representing text semantic contents as numeric vectors, by leveraging powerful state-of-the-art deep learning models and techniques. Then, I will compare facebook’s Faiss python library with a brute force similarity search approach, focusing on the cosine similarity measure. For that, we will explore a very cool dataset with tens of thousands of movie plots.

Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically, very large) spaces of objects where the only available comparator is the similarity between any pair of objects.
Faiss (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It’s written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.

The Wikipedia Movie Plots Dataset

This famous dataset is a friendly one to explore the concept of semantic similarity search, and big enough so that we can start to feel the extra optimization steps needed when dealing with Big Data. It contains descriptions of more than 30,000 movies from around the world. The original data can be found here, but we will be using this version that I’ve built by simply summarizing the plots so that they are limited to 128 tokens(the details of how I summarized It are on the dataset page).

The following code can be found on this GitHub repository.

A small grasp at the data

To start, let’s download the dataset, install the dependencies and import them:

With the dataset in hands, we are ready to read in the data with pandas:

Now, we have a moviesDataFrame of 33869 movie titles and their respective plots summarized! Let’s visualize It with an example:

hp_movie_plot :

Generating meaningful representations for our movie plots

There are a huge variety of ways to encode sentences as numeric vectors, and this topic is an article in itself. For this practical example, we will be making use of a Sentence Transformer model. This type of neural network was first introduced on the 2019 Sentence-BERT paper and achieves state-of-the-art performance in various tasks.

Leaving the details apart, a Sentence Transformer uses a siamese network structure(networks have tied weights, meaning they are essentially the same), composed by a transformer with a pooling operation on top of the contextualized word embeddings. It learns in a supervised fashion and expects pair of sentences annotated with their respective cosine similarity for training. One at a time, both sentences are encoded and the network updates itself based on the cosine similarity of the generated sentence embeddings error against the provided label(usually with the MSE loss). Its final goal is to provide text embeddings on vector space such that similar text is close and can efficiently be found using cosine similarity.

SBERT architecture from Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

The intuition behind the cosine similarity measure, Is that semantically similar vectors will point towards the same direction in a multidimensional space. With that in mind, we can interpret the cosine of the angle between two vectors as their similarity.

Embeddings of the words “car”, “truck”, “apple”, and “orange” in two dimensions. Even tho we are dealing with sentences, this illustration might help with building intuition.

To encode our plots, we will be using the pre-trained paraphrase-MiniLM-L6-v2 Sentence Transformer model, since It provides good quality embeddings while still being a lightweight neural network. For more accurate embeddings, consider testing different architectures and fine-tuning on your specific dataset. There’s an overview of the most relevant pre-trained models here.

Our plots are now encoded as vectors of 384 dimensions and stored on the plot_embeddings variable.

Measuring the similarity between movies

To get a better feeling of our embeddings quality, let’s use the scipy’s cosine distance function to verify the similarity between every movie from the Toy Story and the Despicable Me franchises:

cos_sims_df:

Thankfully movies from the same franchise are similar to each other, It makes a lot of sense!

Similarity search: a brute force approach

Let’s see what is the most similar movie in our dataset concerning the 2021's “Godzilla vs. Kong” plot:

I haven’t heard about It, but there is actually a “Godzilla vs. Destoroyah” movie, and we accurately found It as the most similar to our query!

This search took ~0.05 seconds, which is OK on a small scale. We took the brute force path and compared our query against every single movie plot embedding in the database so that we could rank them by similarity and grab the highest ranked one.

Now, let’s suppose that we have a use case where we want to know which movie is the most similar to it for every movie. One could try the following:

Damn! This piece of code took 7 minutes to run, with a RAM consumption peaking over 12 GB. There has to be a better way to do this!

Efficient similarity searches with Faiss

Faiss is built around an index type that stores a set of vectors and provides a function to search in them with L2 and/or dot product vector comparison, with GPU support. It implements indexing structures for a lot of different use cases, each one of them corresponding to various trade-offs concerning search time, search quality, memory used per index vector, training time, and the need for external data for unsupervised training. Also, It’s a very well-documented project, providing, among other things, good guidelines for choosing an index, and a great getting started tutorial.

Before going back to solving our search problem, we need to remember that when comparing vectors, the dot product measure(a special case of an inner product) cares about angle and magnitude differences, while the cosine similarity only cares about the angle. With that in mind, we can perform an L2 normalization on both our query and our database, so that every vector has the same magnitude(the sum of the squares will always be up to 1) and the cosine similarity becomes indistinguishable from the dot product.

Faiss has an IndexFlatIP(exact search for inner product) index that suits our needs here. Let’s try to use It on our task:

Cool! This search took 14 seconds to run, with RAM consumption always under 3GB, way better than the 7 minutes run with a 12 GB RAM peak.

Making It faster! Faiss with GPU indexes

We already got a lot of performance improvement on our semantic similarity search, but I would also like to present the power of GPU computing. By simply moving our already created index to a GPU resource, we can perform the very same search:

Same RAM usage, but only 0.1 seconds to run! That’s a little bit of why leveraging GPU computing, whenever possible, is a great idea.

And that is It! Feel free to connect with me on LinkedIn and ask me any questions over there!