Vector Search For AI — Part 1 — Vector Similarity Search Algorithms

Serkan Özal
10 min readOct 16, 2023

--

Source: https://weaviate.io/blog/distance-metrics-in-vector-search

Data is key in the fast-evolving field of Artificial Intelligence (AI). Vector similarity search methods and vector databases are crucial tools in this context. The similarity search helps quickly locate and match data points that are alike within large datasets, which is essential for tasks like image recognition, language processing, and recommendation systems. On the other hand, vector databases efficiently organize and manage these vast and complex sets of data. Together, these two powerful tools support the impressive advancements and abilities we see in today’s AI applications.

Vector similarity search is pivotal for the optimal functionality of vector databases in the diverse landscape of AI. This technique is indispensable for efficiently identifying and retrieving the most relevant vectors from large datasets, thereby facilitating swift and accurate data-driven decisions. With the exploding volume of multi-dimensional data generated daily, vector databases require robust and precise mechanisms for data retrieval. Vector similarity search responds to this need by enabling the efficient comparison and ranking of vectors, significantly reducing the time and computational resources required for querying large datasets. Consequently, the synergy between vector similarity search and vector databases is crucial, with the former providing the necessary retrieval precision and speed that the latter needs to effectively serve AI applications, enhancing their performance and reliability in various tasks and domains.

Vector databases and vector similarity search methodologies have both evolved in tandem with the broader development of computer science, data management, and artificial intelligence. The concept of vectors, foundational to these technologies, originates from linear algebra and has been applied computationally since the advent of digital computers in the mid-20th century.

The development and refinement of vector similarity search techniques have been driven by the growing necessity to analyze and interpret large datasets efficiently. In the 1970s and 1980s, the advent of relational databases prompted researchers to explore innovative data indexing and searching methods, leading to the emergence of early vector space models. These models represented data (like text) as vectors in high-dimensional space, facilitating more effective similarity searches.

Vector databases, conversely, have been conceived more recently, with their development accelerating in the 21st century due to the unprecedented explosion of data. As various domains started generating massive amounts of multi-dimensional data, the need for specialized databases capable of efficiently storing and querying this data became apparent. Vector databases were designed to meet this demand, offering optimized structures and algorithms specifically tailored for handling vector data.

Over the years, continuous advancements in hardware, algorithms, and data structures have contributed to the enhancement of vector similarity search and vector databases. The ongoing research and innovation in these fields promise further improvements, ensuring these technologies remain integral in the continually expanding and evolving landscape of AI and data science.

In the complex and expansive world of data science, efficient and accurate retrieval of information is crucial. As we delve deeper into an era where data is overwhelmingly abundant and diverse, the significance of vector similarity search methods cannot be overstated. These powerful algorithms, designed to quickly identify and rank the most relevant pieces of data within massive datasets, are fundamental to the success of various applications across fields such as artificial intelligence, machine learning, and information retrieval. In this article, I’ll explore the most popular and effective vector similarity search methods that have become indispensable tools for data scientists and researchers globally. Through understanding these methods, readers will gain insight into the mechanisms that allow for precise and speedy data retrieval, ultimately powering the data-driven technologies we rely on every day.

Algorithms

Dot Product

Overview:

The Dot Product is a straightforward, yet powerful, similarity measure used extensively in machine learning, data mining, and statistics for finding the similarity between two vectors. It’s particularly important in the realm of cosine similarity and is fundamental for algorithms employed in search engines, recommendation systems, and other data-driven applications.

Definition and Formula:

Given two vectors, A and B, the Dot Product is computed as follows:

Vector A
Vector B
Dot product formula

Applications in Similarity Search:

  • Cosine Similarity Basis: The Dot Product is fundamental for calculating cosine similarity, which is a widely used metric in similarity search. When the vectors are normalized, the dot product essentially provides the cosine of the angle between the two vectors, offering an effective measure of similarity.
  • Efficient Retrieval: It allows for efficient retrieval of similar vectors in large databases, making it vital for search engines and recommendation systems where speed is crucial.
  • Machine Learning Models: Used in training machine learning models, especially in deep learning, where the dot product operation is employed extensively during the forward and backward propagation stages.

Advantages:

  • Speed: It’s computationally efficient, providing fast calculations which are crucial for real-time applications.
  • Simplicity: Due to its straightforward formula, it’s easy to implement and understand.

Limitations:

  • Magnitude Sensitivity: The dot product is sensitive to the magnitude of the vectors, and it may not accurately represent similarity if the magnitudes of the vectors being compared are significantly different.

Conclusion:

The Dot Product is a foundational vector similarity search algorithm with widespread applications and relevance in various fields. Its simplicity and efficiency make it a go-to choice for professionals working with large datasets and requiring fast similarity computations. Understanding its characteristics, applications, advantages, and limitations is crucial for effectively leveraging its capabilities in practice.

Reference Implementation:

public double dotProduct(float[] v1, float[] v2) {
float dotProd = 0f;
for (int i = 0; i < v1.length; i++) {
dotProd += v1[i] * v2[i];
}
return dotProd;
}

Cosine Similarity

Overview:

Cosine Similarity is a widely used metric for measuring similarity between two vectors, often employed in the fields of information retrieval, text mining, and machine learning. It gauges the cosine of the angle between two vectors, providing insights into their orientation in a multi-dimensional space.

Definition and Formula:

Given two vectors, A and B, the Cosine Similarity between these vectors is computed as:

Vector A
Vector B
Cosine similarity formula

Applications in Similarity Search:

  • Document Similarity: Often used in natural language processing to measure similarity between texts, aiding in document retrieval and clustering.
  • Recommendation Systems: Cosine Similarity is employed in collaborative filtering to generate recommendations by comparing user or item profiles.
  • Image Comparison: It’s applied in computer vision for comparing feature vectors of images.

Advantages:

  • Angle Measurement: It measures the cosine of the angle between vectors, making it effective for comparing documents of different lengths.
  • Normalization: The metric inherently normalizes vector lengths, making it sensitive to the direction of the data, not the magnitude.

Limitations:

  • Zero Vector Issue: It does not handle zero vectors well, as it becomes undefined.
  • Not a Metric: Cosine Similarity doesn’t satisfy the triangle inequality and therefore is not a true metric.

Conclusion:

Cosine Similarity is a robust and versatile vector similarity search algorithm widely applied across various domains. Though it offers several advantages, like insensitivity to magnitude, understanding its limitations is crucial for effective application. With careful implementation, it remains a valuable tool for professionals working on similarity search, document retrieval, and recommendation systems in the data science and AI fields.

Reference Implementation:

public double cosineSimilarity(float[] v1, float[] v2) {
double dotProd = 0.0;
double v1SqrSum = 0.0;
double v2SqrSum = 0.0;
for (int i = 0; i < v1.length; i++) {
dotProd += v1[i] * v2[i];
v1SqrSum += Math.pow(v1[i], 2);
v2SqrSum += Math.pow(v2[i], 2);
}
return dotProd / (Math.sqrt(v1SqrSum) * Math.sqrt(v2SqrSum));
}

Manhattan Distance (L1 Distance)

Overview:

Manhattan Distance, also known as L1 norm or City Block Distance, is a method used to compute the distance between two points in a grid-based path, resembling the layout of streets in a city. This approach is popular in various domains, including computer science, information theory, and statistics.

Definition and Formula:

Given two vectors, A and B, the Manhattan Distance between these vectors is computed as:

Vector A
Vector B
Manhattan distance formula

Applications in Similarity Search:

  • Clustering Algorithms: Manhattan Distance is often used in various clustering algorithms where distance measurement between data points is crucial.
  • Classification Tasks: In machine learning, it can be employed as a heuristic function for classification tasks.
  • Image Analysis: This measure is often used in image analysis and computer vision for comparing similarity between images.

Advantages:

  • Relevance in Grid-Like Paths: It’s particularly useful for applications that have grid-like architectures, where diagonal movement isn’t possible.
  • Sensitivity to Outliers: Manhattan Distance is less sensitive to outliers as it doesn’t emphasize extreme values.

Limitations:

  • Not the Shortest Distance: It doesn’t always represent the shortest distance between two points, as it’s constrained to movement along grid lines.

Conclusion:

Manhattan Distance offers a straightforward and effective approach for measuring similarity in various applications, particularly where data is organized in a grid-like structure. Though simple, understanding its applications, advantages, and limitations is essential for effectively deploying it in similarity search and analysis tasks. Recognizing when and where to utilize Manhattan Distance will be vital for professionals and researchers engaged in data science, machine learning, and related fields.

Reference Implementation:

public double manhattanDistance(float[] v1, float[] v2) {
double sumAbsDiff = 0.0;
for (int i = 0; i < v1.length; i++) {
sumAbsDiff += Math.abs(v1[i] - v2[i]);
}
return sumAbsDiff;
}

Euclidean Distance (L2 Distance)

Overview:

Euclidean Distance is a popular measure of similarity used in data mining, machine learning, and statistics. It calculates the “straight-line” distance between two points in Euclidean space, serving as a straightforward measure of vector dissimilarity.

Definition and Formula:

Given two vectors, A and B, the Euclidean Distance between these vectors is computed as:

Vector A
Vector B
Euclidean distance (L2 distance) formula

Applications in Similarity Search:

  • Clustering Algorithms: Euclidean Distance is fundamental for clustering algorithms like K-Means, aiding in assigning data points to clusters by minimizing the distance to the cluster center.
  • Image Similarity: It’s often applied in computer vision to find similar images through comparing the distance between feature vectors.
  • Recommendation Systems: Used in recommendation systems, Euclidean Distance helps find similar items or users.

Advantages:

  • Intuitiveness: The measure is straightforward and intuitive, reflecting the physical distance between points in space.
  • Simplicity: With an uncomplicated formula, it’s easy to implement and understand.

Limitations:

  • Scale Sensitivity: Euclidean Distance is sensitive to the scale of features, necessitating feature scaling for accurate results.
  • High Dimensionality Issues: In high-dimensional spaces, the distance between points tends to be uniform, making Euclidean Distance less effective.

Conclusion:

Euclidean Distance is a crucial vector similarity search algorithm, finding application across various domains. While simple and intuitive, its limitations, especially with high-dimensional data or features of different scales, must be considered. Understanding its nuances is vital for those looking to effectively implement Euclidean Distance in similarity search tasks and data analysis.

Reference Implementation:

public double euclideanDistance(float[] v1, float[] v2) {
double sumSqrDiff = 0.0;
for (int i = 0; i < v1.length; i++) {
sumSqrDiff += Math.pow(v1[i] - v2[i], 2);
}
return Math.sqrt(sumSqrDiff);
}

Benchmark

You can find the benchmark code in the Github here: https://github.com/serkan-ozal/java-vector-distance-benchmark

Environment Configurations

I have run benchmarks on Github actions using Github hosted runners:

Environment Configurations

Benchmark Configurations

Benchmark Results

Conclusion

In this blog post, we delved into the world of vector similarity search algorithms, comparing the strengths and weaknesses of “Dot Product”, “Cosine Similarity”, “Manhattan Distance”, and “Euclidean Distance”. The insights gained from this comparative analysis provide a valuable foundation for our next journey — improving the performance of these algorithms. In an era marked by an ever-expanding wealth of data and increasing complexity, optimizing vector similarity search is paramount.

First and foremost, it’s crucial to recognize that no single algorithm reigns supreme in all scenarios. The choice of the most suitable algorithm depends on the nature of the data, the dimensionality, and the specific problem at hand. Therefore, flexibility in algorithm selection is key.

In our forthcoming blog posts, we will explore ways to enhance the performance of these algorithms. This involves fine-tuning parameters, leveraging hardware acceleration, and implementing efficient indexing strategies. We will also delve into techniques like dimensionality reduction, approximate nearest neighbor search, and data preprocessing to further boost their capabilities.

Moreover, the field of vector similarity search is dynamic, with ongoing research and innovation. Staying updated with the latest advancements and emerging algorithms is essential for ensuring your applications stay at the cutting edge.

In conclusion, as we navigate the vast sea of data, these algorithms serve as our guiding stars. Understanding their nuances and continuously seeking ways to optimize their performance will empower us to unlock the true potential of vector similarity search, enabling us to extract meaningful insights, drive innovation, and tackle complex data challenges in our ever-evolving digital landscape. Stay tuned for our next blog post, where we explore the practical steps to enhance the performance of these algorithms and keep them shining bright in the data-driven universe.

--

--

Serkan Özal

AWS Serverless Hero | Founder & CTO @ Thundra | Serverless Researcher | JVM Hacker | Oracle OpenSource Contributor | AWS Certified | PhD Candidate