Vector similarity search for backend engineers

Alexander Shmyga
12 min readNov 19, 2023

--

Introduction

For many back-end engineers, the world of vector similarity search might seem distant from their day-to-day software engineering tasks. However, as the popularity of AI/ML models continues to grow, an increasing number of back-end engineers find themselves involved in projects that leverage vector similarity search. The difficulties of this topic can be challenging, yet they bring a unique charm, especially when dealing with vast datasets of multidimensional vectors.

This blog post aims to shed light on the mysteries of this often-overlooked subject, offering insights into the fundamental concepts that construct vector similarity search. Join us as we delve into the essential aspects and describe the basics of this intriguing field.

Limitations of traditional search algorithms

Before we jump into the vector search, let’s look into some of the limitations of traditional search:

  1. Scalability: Traditional search algorithms may struggle to scale effectively with the growing volumes of data generated in modern applications. As data sets expand, the time complexity of traditional algorithms can lead to poor performance, and slow down responsiveness in large-scale systems.
  2. Adaptability to Dynamic Data: In dynamic environments where data is constantly changing, traditional search algorithms may struggle to adapt quickly. Real-time updates or changes in the data may not be efficiently incorporated, impacting the freshness and accuracy of search results.
  3. Handling Ambiguity: Traditional search algorithms face difficulties with words that carry multiple meanings. Consider the term “jaguar,” which could signify the animal or the luxury car brand. A keyword-centric search might struggle to discern these contexts, resulting in mixed or inaccurate outcomes.
  4. Handling Synonyms: These algorithms often overlook pertinent results that incorporate synonyms of the query terms. For instance, a search for “vehicle” might neglect results that use the term “automobile”, “car”, “machine” or “motor”.
  5. Complex Queries: Traditional search mechanisms falter when confronted with sophisticated queries involving the comprehension of relationships between diverse terms. For instance, a query like “the best least the Latin” demands an understanding of the user’s intent, a capability traditional search algorithms lack.

Vector similarity search and related techniques aim to address some of these limitations by providing a more flexible and scalable approach to finding similarities in high-dimensional data, making them particularly relevant in the context of modern data-driven applications.

But before we get to the search, let’s first explore the basic….

The Basics…

What is a vector? Vector Embeddings.

In mathematics and physics, vector is a term that refers colloquially to some quantities that cannot be expressed by a single number (a scalar), or to elements of some vector spaces.

AI definition for the same:

Vector embeddings are lists of numbers used to represent complex data like text, images, or audio in a numerical format enabling machine learning algorithms to process them.

In simple words, it is an array of numbers, for example [1, 77, 24, 3…] or [0.1245, 0.8599, 0.7712…]

The amount of numbers in one array is called dimension. So a vector [0.1245, 0.8599, 0.7712] is a three-dimensional vector.

Vector classifications (based on dimensions).

Based on the number of dimensions all vectors are classified in the following way:

  • Low-Dimensional Vectors: typically, vectors in 2D, 3D, or up to around 10 dimensions might be considered low-dimensional.
  • Mid-Dimensional Vectors: vectors with dimensions ranging from around 10 to 100 might be considered mid-dimensional. This range provides more complexity than low-dimensional spaces while avoiding the computational challenges of very high-dimensional spaces.
  • High-Dimensional Vectors: vectors with dimensions exceeding 100, especially those in the hundreds or thousands, are often considered high-dimensional. In machine learning, datasets with features/dimensions in the order of hundreds or more are common.

Some examples:

Imagine creating a personal home library where you aim to organize your books on the shelves according to the similarity of their content. So in this case we would use a concept of embeddings (also referred to as encodings) to help us with this. To facilitate this organization, you can assign a unique three-digit code to each book, ensuring that books with similar content receive codes that share commonalities. For instance:

  • The Comedy of Errors by William Shakespeare: [1, 2, 3]
  • The Winter’s Tale by William Shakespeare: [1, 2, 4]
  • Statistical Consequences of Fat Tails by Nassim Taleb: [5, 6, 7]

Check how the Shakespeare book codes are very close to each other, while the book about statistics is completely different. That’s because the Taleb’s book is very different from the other books.

Vector embeddings can also be used for words, pictures, sounds, and many other things.

Also is worth mentioning that the number of dimensions used in one context, for search or ordering, is always the same for all vectors.

Let’s look into another example. Some algorithm processes an image (but it can be any other sort of data — audio, text, etc…) and represent it as a vector.

How do you get those embeddings?

The process of obtaining embeddings involves leveraging an embedding model or technique. In most cases, you’re provided with a model or API. You feed your input data into this model or API, and in return, you receive the corresponding embeddings. It’s like having a tool that transforms your data into a numerical format, making it suitable for various machine-learning applications.

How do you calculate the similarity between two vectors?

There are several ways you can calculate the similarity between two vectors, more used are

In this article, our focus will be solely on Euclidean distance and Cosine similarity.

Euclidean distance.

It is like a straight-line measurement between two points in space. Imagine you have a point A and a point B on a piece of paper. The Euclidean distance is the length of the imaginary straight line connecting these two points. It’s the most direct path from one point to the other.

To calculate it, you might use the Pythagorean theorem, which you may have learned in geometry. This theorem helps find the length of one side of a right-angled triangle when you know the lengths of the other two sides. In the case of Euclidean distance, these “sides” are the differences in the coordinates (like x and y values) of the two points.

So, in essence, Euclidean distance is a way of measuring the straight-line distance between two points, taking into account the differences along each dimension (like horizontal and vertical distances on the paper).

In a multidimensional space, the formula extends to:

Cosine similarity.

Imagine you have two arrows, A and B, pointing in different directions. Now, if these arrows point in almost the same direction, the cosine similarity between them is high. If they are at right angles (perpendicular), the cosine similarity is zero. If they point in completely opposite directions, the cosine similarity is low.

In more technical terms, cosine similarity measures the cosine of the angle between two vectors. Vectors are just a way to represent quantities that have both magnitude and direction, like those arrows. So, when we talk about the cosine similarity between two vectors, we’re essentially looking at how aligned their directions are.

In the context of text or data, each vector could represent the frequency of words in a document, for example. High cosine similarity indicates that the documents are similar, while low cosine similarity suggests dissimilarity. It’s a way to measure similarity that’s robust to the length of the vectors, focusing more on their directional agreement.

Which distance to use?

Deciding between Euclidean distance, cosine similarity, or any other search method hinges on the characteristics of your data and the specific needs of your task. For instance, cosine similarity shines in natural language processing (NLP) and text analysis, emphasizing directional alignment over magnitude. On the other hand, Euclidean distance proves more beneficial in scenarios where the spatial arrangement and absolute magnitude of vectors are pivotal considerations. Don’t be overly concerned if these distinctions aren’t immediately clear — the choice of similarity calculation method is typically determined by AI professionals or departments familiar with the nuances of your data and problem domain.

Now let’s rock the search…

Starting with the popular k Nearest Neighbors search (kNN).

Similarity search is a broad term encompassing various mechanisms that involve searching through vast spaces of objects. This becomes increasingly crucial in the era of large information repositories, especially when dealing with objects lacking a natural order, such as extensive collections of images, sounds, and other complex digital entities.

In simple terms, vector similarity search is the process of calculating distances between an input vector and all available vectors. The results are then ordered by distance, and the top k (requested number) closest vectors are returned to the caller. This is usually called kNN search.

Example:

Let’s revisit our bookshelf analogy. Picture yourself with a new book, “Romeo and Juliet” by William Shakespeare. You assign your book a code and search for similar books to find precisely where it belongs on your bookshelf.

Here’s a simple illustration of when our code is represented by 2-dimensional vectors:

In a basic “brute force algorithm”, you calculate the similarity between your input vector and all existing vectors, choosing the most similar books to determine where your new book should be placed.

While this approach is effective, especially in low-dimensional spaces, it has some limitations, such as scalability and cost consequences. High-dimensional vectors, those with hundreds or thousands of dimensions, introduce additional challenges. This is where the curse of dimensionality comes into play, making data increasingly sparse, and computational requirements grow exponentially with the number of dimensions.

For exact kNN search, common algorithms like Multidimensional Binary Search Trees (kdTree) and R-trees are employed. These dynamic index structures excel in handling low-dimensional vectors (like 2D or 3D), providing efficient solutions. However, their performance diminishes rapidly as data dimension increases, a consequence of the curse of dimensionality.

Approximate Nearest Neighbors search (ANN) algorithms

To address the challenges posed by high-dimensional data, we turn to Approximate Nearest Neighbors search (ANN) algorithms. These algorithms, such as the locality-sensitive hashing family, are designed to perform better in higher dimensions, although precision may not be ideal. There are also sophisticated algorithms, like those based on random forest trees, that aim to balance precision and efficiency, though they can be complex to implement and maintain.

But don’t lose sleep over all these algorithms unless you have a very specific case. Existing databases like Elasticsearch or Milvus can handle your vector search efficiently.

When vector similarity search is used?

Vector similarity search is employed in various products and services across different industries. Here are some examples:

E-commerce and Retail:

  • Product Recommendations: Online retail platforms use vector similarity search to suggest products similar to the ones customers have viewed or purchased.
  • Visual Search: E-commerce platforms utilize image embeddings for visual search, enabling users to find products similar to the images they provide.

Search Engines:

  • Content Retrieval: Search engines leverage vector similarity search to improve the accuracy and relevance of search results, especially in semantic search applications.

Social Media Platforms:

  • Content Recommendation: Social media platforms employ vector similarity search to recommend content, friends, or connections based on users’ preferences and interactions.

Video Streaming Services:

  • Content Recommendations: Video streaming platforms use vector similarity search to recommend movies or TV shows based on user viewing history and preferences.

Genomics and Bioinformatics:

  • DNA Sequencing: Genomic research relies on vector similarity search to analyze DNA sequences, identify genetic similarities, and discover variations.

Cybersecurity:

  • Anomaly Detection: Vector similarity search is used in cybersecurity products to detect anomalies in network behavior and identify potential threats.

Content Management Systems:

  • Content Filtering: Content management systems use vector similarity search for efficient content filtering and retrieval based on the similarity of documents.

Recommendation Engines:

  • Personalized Recommendations: Various recommendation engines across industries, such as music streaming services and news aggregators, employ vector similarity search to provide personalized recommendations to users.

Healthcare:

  • Drug Discovery: Vector similarity search aids in drug discovery by analyzing molecular structures and identifying compounds with similar characteristics.

Data Analytics:

  • Clustering and Pattern Recognition: Vector similarity search is applied in data analytics for clustering similar data points and identifying patterns in large datasets.

These examples highlight the versatility of vector similarity search across different domains, demonstrating its effectiveness in enhancing search accuracy, recommendation systems, and data analysis.

Industry libraries and databases that use kNN/ANN search.

Several popular libraries/frameworks and databases are designed for implementing vector similarity search. Here are some notable ones:

Annoy: It is a C++ library with Python bindings that efficiently handles nearest neighbors search, particularly well-suited for large datasets.

Faiss: Developed by Facebook AI Research, Faiss is a highly efficient library with GPU acceleration, specializing in similarity search and clustering of dense vectors.

Milvus: It is an open-source vector database that supports similarity search and is designed to scale with large volumes of vectors.

NMSLIB (Non-Metric Space Library): It is a collection of algorithms for non-metric and metric space indexing and searching. It offers bindings for various programming languages.

Scikit-learn: A prominent Python machine learning library, provides modules for nearest neighbors search, clustering, and dimensionality reduction, making it versatile for vector similarity search.

Elasticsearch: A distributed search and analytics engine, can be configured to handle vector similarity search use cases.

ANN (Approximate Nearest Neighbors): It offers algorithms for approximate nearest neighbor searching, supporting various distance metrics. It is available in C++ with Python bindings.

Gensim: It is a Python library for topic modeling and document similarity analysis. It includes implementations of algorithms for vector space modeling.

Rapids cuML: As a part of the NVIDIA Rapids ecosystem, provides GPU-accelerated machine learning algorithms, including those for nearest neighbors search.

Hnswlib: A C++ library with Python bindings, focuses on approximate nearest neighbors search using Hierarchical Navigable Small World graphs.

Azure SQL database: a ground-breaking capability of vector search in Azure AI Search (previously Azure Cognitive Search), you can do vector search with data stored in Azure SQL Database easily.

These libraries and databases cover a spectrum of use cases, ranging from approximate nearest neighbors searches to scalable vector databases. Depending on your specific needs and programming language preferences, you can choose the library that aligns best with your requirements.

Wrap up the exploration

I hope you’ve enjoyed our journey into vector similarity search and got or refreshed knowledge on the topic. It’s evident that vector similarity search is not merely a technical concept but a transformative force, reshaping how backend engineers approach data retrieval in our data-driven era. The integration of mathematical vectors with cutting-edge search algorithms becomes, not just a tool, but a necessary asset in the arsenal of forward-thinking backend engineers, guiding us into a future where precision meets scalability in the pursuit of knowledge and innovation.

In the next articles, I’m planning to dive deeper into how to calculate similarity between vectors with some code examples and write a simple brute force vector search application.

--

--

Alexander Shmyga

🚀 Software Architecture | Software Engineering | Leadership | 🎵 Music Aficionado | ♟️ Chess Enthusiast | 🌐 World explorer