A gentle introduction to Vector Search

13 min readJun 21, 2023

· Introduction to Neural Networks Embeddings
∘ Recap of basics
∘ What is Embedding?
∘ Intuitive Explanation
∘ Why embeddings?
∘ Various Applications of Embeddings
· Vector search
∘ Concept
∘ What if data is huge?
· HNSW gentle introduction
∘ Probability Skip List
∘ Navigable Small World Graphs
∘ Creating HNSW
∘ Graph Construction
∘ Vector databases and frameworks overview
· Conclusion
∘ Links

Introduction to Neural Networks Embeddings

This article aims to shed some light on vector search magic and may be helpful for anyone who is not well-versed in the current vector hype. This article can be particularly useful for developers who work with this technology in their companies and wish to gain a better understanding of its functioning.

Recap of basics

The concept of embeddings and vector search heavily relies on the Neural Network concept and its terminology. To comprehend this topic, it is essential to have a basic understanding of vectors, matrix operations, model weights and biases, as well as loss functions. If these ideas are unfamiliar to you, I recommend reading a brilliant article that covers the basics of Neural Networks.

In the real world, this example would involve much more complex architectures and training strategies. Our data could be far more unstructured than a simple array of numbers, encompassing various formats such as images, text, audio, graphs, and multimodal structures. Not all layers in a Neural Network are as simple as the one given, and the output data could also be very complex, so are the loss functions for different tasks. The NN architectures itself may range from being as simple as the one mentioned above to being extremely huge, involving billions of trainable parameters. However, the underlying principle remains the same.

In simple terms, training a Neural Network means finding the optimal weights that yield the best predictions for a given input.

What is Embedding?

Now, let’s assume that we have trained a Neural Network for our task. During the prediction stage, we feed data to our model, which passes through the network layer-by-layer. As the input data passes through each layer, it undergoes transformations and transitions into a new vector space. This transformed data is referred to as an embedding.

To provide a clearer definition, an embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

In general, any intermediate data representation within the neural network could be referred to as an embedding, but typically we focus on the last flattened layer of the network, excluding task-specific model prediction head (like task-specific classification head, for example). Therefore, if we perceive the network as a black box function, it operates in the following manner:

Input data → Black box NN as embedding model → low-dimensional vector (embedding) representing our input data

Intuitive Explanation

Normally, embeddings are saved as vectors. Each vector has its own position in vector space and the neighbour points are more similar to each other and thus should be classified into the same groups. This phenomenon, often referred to as “semantic proximity,” is a desirable property of embeddings. It means that words, images or other input data with related meanings or similar information tend to be located in close proximity in the embedding space. For example, the embeddings of “cat” and “dog” are expected to be closer to each other compared to the embeddings of “cat” and “car.” The reason similar embeddings represent similar objects is that the training algorithm captures and encodes the inherent relationships and similarities present in the training data. Therefore, we tend to use embeddings for community detection or group clustering with the help of cosine similarity. And it is generally used in classification tasks as well.

To illustrate the idea mentioned above, what we can do is use dimension deduction techniques (PCA, UMAP, etc.) to reduce the size of the high dimensional vectors to 2D/3D and draw the points on a plot. Here is a blog that specifically shows you how to achieve that in a 3D space:

Visualize High-Dimensional Network Data with 3D 360-Degree-Animated Scatter Plot

Use node2vec, networkx, pca, seaborn, etc. to visualize high-dimensional network data

towardsdatascience.com

Why embeddings?

Ok, now we see what actually are embeddings, but why do we need them. The most important point in this concept is that embedding is a valid vector representation of your data.

First, current machine learning models continue to favour numerical values as inputs. However, when researching computer vision or voice recognition, we are unlikely to be able to collect or only collect numerical data for our targets/dependent variables.

Second, it helps reduce dimensions. Someone may argue that the one-hot-encoding technique is how we handle categorical variables. However, in today’s data science world, it has proven to be much less effective.

Yet, consider the following scenario: we are researching consumer feedback for three products. We would only have one variable for each observation — the review content. We can build a term-document matrix and then put it into a classifier or some other algorithms. However, let’s suppose we have 50 thousand reviews for each product, and the number of total unique words in the corpus is one million. Then we will end up with a matrix whose shape is (150K x 1M). This is a ridiculously large input for any model. And that is when we need to bring in the idea of embedding.

The third reason is to reduce complexity. It is kind of like the extension of the second reason. Embedding also can help translate very complex information into a vector of numbers. Here is an example of social network analysis:

Initially, we collected the data from social media and converted it into a social network analysis graph. In the graph, we can use the distance between the nodes and the color of the ties to interpret the similarities between the nodes. However, it is complicated and hard to read. Right now, we only have 9 nodes in the graph, and it is already a mess. Can you imagine what would happen if we were to investigate 100 nodes? This is referred to as complex (high-dimensional) data. However, by using certain techniques to aid in dimensionality reduction, we can transform the graph into a list of embeddings. As a result, instead of the jumbled graph, we now have a new, clean “dictionary” for the nodes. We can use the “dictionary” to make a human-readable visualisation.

Various Applications of Embeddings

You must have been excited to see some popular applications of embeddings after discovering what it is. In real world tasks, we usually have different NN architectures for different data types and tasks. Here are just couple of basic network architectures that are widely implemented today and often used as a backbone for a more specific networks or for extracting embeddings from data:

Natural Language Processing: BERT (explained here)
Computer Vision: ResNet (explained here)
Social Network Analysis: GCN (explained here)

Vector search

Concept

Now, let’s assume that we have a general understanding of how Neural Networks work and how to transform our input data into low-dimensional vectors. Let’s consider how we can apply this knowledge to a real-world problem, such as a text-image search task. To clarify, the objective is to provide our users with a list of the most relevant images based on a given text query.

The first and most obvious approach would be to treat this task as a classification problem. Each text query would represent a different class, and for each image, we would aim to predict the probability of it corresponding to a particular class (request).

If we have a limited and consistent categorisation map, it would be a simple and efficient solution. However, what if we have huge number of classes? Or if we are unaware of all the potential classes? In some cases, a user’s request may be entirely unique, leaving us with the following options:

Retraining the model with a new class (which is impossible if there are numerous classes)
Utilising a “similar” class instead of the exact request (although this may result in a decrease in quality)
Use an alternative method

The idea is to move away from the setup of a classification task and instead focus on the last layer embeddings of the Neural Network. As we know, if properly trained, these embeddings can represent aggregated data information in a low-dimensional vector space. Therefore, the concept is to obtain image embeddings for the images and obtain text embeddings of the same size for the request text. The goal is to determine if they are similar or not.

However, a challenge arises because the vector spaces for these two types of embeddings are not the same. Even if two similar images have close image embeddings and two similar requests have close text embeddings, it does not necessarily mean that the image and text embeddings will be close when the request is relevant to the image. Hence, our task is to design a joint model that can be trained to transform both types of input data into a joint vector space. In this vector space, the input request should be close to a relevant image based on a similarity measure such as cosine distance.

To train such a NN, special loss functions like Contrastive Loss are utilised. This loss function is calculated for each request-image pair used in training. For a more in-depth understanding of Contrastive Loss and its alternatives, I invite you to refer to my previous article.

Alright, let’s now consider the scenario where we have trained such a model. At this point, our embeddings are close for images that are relevant to users’ requests, based on a certain measure.

Now, if we have a database filled with numerous images, we could calculate the embeddings of all of these images. Additionally, we can calculate the embedding of an input request. By computing the cosine similarity between the input request embedding and the embeddings of each image, we can identify the most similar images based on the measure used. This approach would certainly work well for a small number of images.

However, what if we are dealing with a database containing billions of images and we require our recommendations to be real-time?

What if data is huge?

If you need to calculate cosine similarity between 128-dim request embedding and 4 billions of 128-dim image embeddings it will require 256 operation for each image and around 10**12 operations or around 6 minutes for hypothetical 3GHZ processor. Seems a lot and definitely not real time… So what to do? Let’s find out how not to compare request embedding with all embeddings in database.

There are lots of different approximate search algorithms, but here will cover briefly HNSW or Hierarchical Navigable Small World algorithm, which now appears to be an industry standard for large-scale vector search. It has many different variations and is implemented in many search frameworks such as QDrant, Pinecone, Milvus and others.

HNSW gentle introduction

Probability Skip List

The probability skip list was introduced way back in 1990 by William Pugh. It allows fast search like a sorted array, while using a linked list structure for easy (and fast) insertion of new elements.

Skip lists work by building several layers of linked lists. On the first layer, we find links that skip many intermediate nodes/vertices. As we move down the layers, the number of skips by each link is decreased.

To search a skip list, we start at the highest layer with the longest skips and move along the edges towards the right (below). If we find that the current node key is greater than the key we are searching for — we know we have overshot our target, so we move down to previous node in the next level.

HNSW inherits the same layered format with longer edges in the highest layers (for fast search) and shorter edges in the lower layers (for accurate search).

Navigable Small World Graphs

Vector search using Navigable Small World (NSW) graphs was introduced over the course of several papers from 2011 to 2014. The idea is that if we take a proximity graph but build it so that we have both long-range and short-range links, then search times are reduced to logarithmic complexity.

Each vertex in the graph connects to several other vertices. We call these connected vertices friends, and each vertex keeps a friend list, creating our graph.

When searching an NSW graph, we begin at a pre-defined entry-point. This entry point connects to several nearby vertices. We identify which of these vertices is the closest to our query vector and move there.

We repeat the greedy-routing search process of moving from vertex to vertex by identifying the nearest neighbouring vertices in each friend list. Eventually, we will find no nearer vertices than our current vertex — this is a local minimum and acts as our stopping condition.

Our stopping condition is finding no nearer vertices in our current vertex’s friend list. Because of this, we are more likely to hit a local minimum and stop too early when having fewer links (less likely to find a nearer vertex).

To minimise the probability of stopping early (and increase recall), we can increase the average degree of vertices, but this increases network complexity (and search time). So we need to balance the average degree of vertices between recall and search speed.

Creating HNSW

HNSW is a natural evolution of NSW, which borrows inspiration from hierarchical multi-layers from Pugh’s probability skip list structure.

Adding hierarchy to NSW produces a graph where links are separated across different layers. At the top layer, we have the longest links, and at the bottom layer, we have the shortest.

Spreading nodes of HNSW graph over different layers (image by Pinecone)

During the search, we enter the top layer, where we find the longest links.

We traverse edges in each layer just as we did for NSW, greedily moving to the nearest vertex until we find a local minimum. Unlike NSW, at this point, we shift to the current vertex in a lower layer and begin searching again. We repeat this process until finding the local minimum of our bottom layer — layer 0.

HNSW search algorithm visualisation (image by Pinecone)

Graph Construction

Nodes in HNSW are inserted sequentially one by one. Every node is randomly assigned an integer l indicating the maximum layer at which this node can present in the graph. For example, if l = 1, then the node can only be found on layers 0 and 1. The authors select l randomly for each node with an exponentially decaying probability distribution normalised by the non-zero multiplier mL (mL = 0 results in a single layer in HNSW and non-optimized search complexity). Normally, the majority of l values should be equal to 0, so most of the nodes are present only on the lowest level. The larger values of mL increase the probability of a node appearing on higher layers.

L — number of layers, l — maximum layer at which this node can present in the graph

Choosing a layer for a new node (image by Pinecone)

Graph construction starts at the top layer. After entering the graph the algorithm greedily traverse across edges, finding the ef nearest neighbors to our inserted vector q — at this point ef = 1.

After finding the local minimum, it moves down to the next layer (just as is done during search). This process is repeated until reaching our chosen insertion layer (l). Here begins phase two of construction.

The ef value is increased to ef_construction (a parameter we set), meaning more nearest neighbours will be returned. In phase two, these nearest neighbours are candidates for the links to the new inserted element q and as entry points to the next layer.

M neighbors are added as links from these candidates — the most straightforward selection criteria are to choose the closest vectors. After working through multiple iterations, there are two more parameters that are considered when adding links. Mₘₐₓ, which defines the maximum number of links a vertex can have, and Mₘₐₓ₀, which defines the same but for vertices in layer 0.

Rebuilding HNSW graph vertices after adding a new node (image by Pinecone)

The insertion stopping condition is reaching the local minimum in layer 0.

Vector databases and frameworks overview

Most of today’s vector search solutions incorporate implementations of HNSW or similar algorithms. Here is a brief list of the most popular solutions and their algorithms:

QDrant: Custom HNSW implementation in Rust.
Milvus: Allows multiple ANN algorithm based indexes: FAISS, ANNOY, HNSW, RNSG.
Pinecone: Exact KNN powered by FAISS; ANN powered by proprietary algorithm. All major distance metrics are supported: cosine (default), dot product and Euclidean.
Vespa: HNSW (modified for realtime CRUD and metadata filtering); a suite of reranking and dense retrieveal methods.
Weaviate: custom-implemented HNSW, tuned to scale, and to support full CRUD. The system supports plug-in ANN algorithms as long as they can do CRUD.
Vald: Based on the fastest algorithm: NGT, which is faster than many strong algorithms, such as Scann and HNSW.

Take a look at a brilliant article written by Dmitri Khan comparing most of current solutions.

Benchmarks could also be helpful:

Some vector databases like QDrant also allow you to combine vector search with additional filtering, which might be crucial if you need to get best result not over the whole data but in some part of it. For example, if you want to filter most relevant images for request posted during last month.

Conclusion

Hope this article will help to shed some light on the vector search concept, algorithms, practical application and usage.

Stay safe!