Exploring Distance Metrics for Vector Embedding Clustering

Csakash
7 min readSep 9, 2024

--

Part 1.1 of vector embedding essentials

Checkout the blog for more clear understanding: link

In machine learning and vector embeddings, one of the key tasks is to group similar vectors together — a process known as clustering. To do this effectively, you must define a way to measure how “similar” or “close” different vectors are to each other. This measure is often calculated as the distance between two points in a multidimensional space.

In this blog, we’ll explore various types of distance metrics used in vector embeddings, discuss methods for calculating them mathematically, and demonstrate technical applications of these distances. We will also include relevant graphs and examples to make the topic clearer.

Kindly use regx formatter to read the formula!!

1. Euclidean Distance

The Euclidean distance is perhaps the most intuitive way to measure the distance between two points in a vector space. It is simply the straight-line distance in Euclidean (n-dimensional) space.

Formula
For two vectors \( \mathbf{X} = (x_1, x_2, …, x_n) \) and \( \mathbf{Y} = (y_1, y_2, …, y_n) \), the Euclidean distance is calculated as:

\[
d(\mathbf{X}, \mathbf{Y}) = \sqrt{\sum_{i=1}^{n} (x_i — y_i)²}
\]

Example
Let’s calculate the Euclidean distance between two points:
- \( \mathbf{X} = (2, 3) \)
- \( \mathbf{Y} = (5, 7) \)

\[
d(\mathbf{X}, \mathbf{Y}) = \sqrt{(5–2)² + (7–3)²} = \sqrt{9 + 16} = 5
\]

2. Manhattan Distance (L1 Distance)

The Manhattan distance, also called L1 distance, is the sum of the absolute differences between corresponding coordinates of two vectors. It is often referred to as the “city block” distance since it mimics the movement along streets laid out in a grid.

Formula
\[
d(\mathbf{X}, \mathbf{Y}) = \sum_{i=1}^{n} |x_i — y_i|
\]

Example
Using the same points \( \mathbf{X} = (2, 3) \) and \( \mathbf{Y} = (5, 7) \):

\[
d(\mathbf{X}, \mathbf{Y}) = |5–2| + |7–3| = 3 + 4 = 7
\]

3. Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. Unlike Euclidean or Manhattan distances, cosine similarity focuses on the Formula of the vectors rather than their magnitude. This is particularly useful for high-dimensional data like text embeddings, where the magnitude of the vectors can vary widely.

Formula
\[
\text{Cosine Similarity}(\mathbf{X}, \mathbf{Y}) = \frac{\mathbf{X} \cdot \mathbf{Y}}{||\mathbf{X}|| \, ||\mathbf{Y}||}
\]

Example
For vectors \( \mathbf{X} = (2, 3) \) and \( \mathbf{Y} = (5, 7) \):

1. Dot product: \( \mathbf{X} \cdot \mathbf{Y} = 2 \times 5 + 3 \times 7 = 31 \)
2. Magnitude of \( \mathbf{X} \): \( ||\mathbf{X}|| = \sqrt{2² + 3²} = \sqrt{13} \)
3. Magnitude of \( \mathbf{Y} \): \( ||\mathbf{Y}|| = \sqrt{5² + 7²} = \sqrt{74} \)

Cosine similarity = \( \frac{31}{\sqrt{13} \times \sqrt{74}} \approx 0.99 \)

4. Hamming Distance

Hamming distance is used for comparing two binary vectors. It counts the number of positions at which the corresponding elements are different.

Formula
\[
d(\mathbf{X}, \mathbf{Y}) = \sum_{i=1}^{n} \mathbf{1}(x_i \neq y_i)
\]
Where \( \mathbf{1}(x_i \neq y_i) \) is 1 if \( x_i \neq y_i \) and 0 otherwise.

Example
For binary vectors \( \mathbf{X} = (1, 0, 1, 1) \) and \( \mathbf{Y} = (1, 1, 0, 1) \):

\[
d(\mathbf{X}, \mathbf{Y}) = 0 + 1 + 1 + 0 = 2
\]

5. Minkowski Distance

The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. It includes a parameter \( p \) that defines the distance type:

- When \( p = 1 \), it’s equivalent to Manhattan distance.
- When \( p = 2 \), it’s equivalent to Euclidean distance.
- Other values of \( p \) create different distance metrics.

Formula
\[
d(\mathbf{X}, \mathbf{Y}) = \left( \sum_{i=1}^{n} |x_i — y_i|^p \right)^{1/p}
\]

Example
For \( \mathbf{X} = (2, 3) \), \( \mathbf{Y} = (5, 7) \), and \( p = 3 \):

\[
d(\mathbf{X}, \mathbf{Y}) = \left( |5–2|³ + |7–3|³ \right)^{1/3} = (27 + 64)^{1/3} \approx 4.64
\]

6. Jaccard Similarity

Jaccard similarity is used to measure the similarity between two sets. It calculates the ratio of the intersection of two sets to their union.

Formula
\[
\text{Jaccard Similarity}(\mathbf{X}, \mathbf{Y}) = \frac{|\mathbf{X} \cap \mathbf{Y}|}{|\mathbf{X} \cup \mathbf{Y}|}
\]

Example
For binary sets \( \mathbf{X} = \{1, 1, 0, 0\} \) and \( \mathbf{Y} = \{1, 0, 1, 0\} \):

\[
\text{Intersection} = \{1\}, \quad \text{Union} = \{1, 1, 0, 0\}
\]

Jaccard similarity = \( \frac{1}{3} \).

Applications of Distance Metrics in Clustering

When applying these metrics to vector embeddings in clustering (such as K-means, DBSCAN, etc.), each method works better for different types of data:

  • Euclidean Distance: Works well in dense, low-dimensional spaces.
  • Manhattan Distance: Best suited for grid-like data structures.
  • Cosine Similarity: Commonly used in high-dimensional spaces (like document vectors).
  • Hamming Distance: Great for comparing binary strings.
  • Minkowski Distance: General-purpose, flexible metric depending on \( p \).

How Distance Metrics Help in Vector Databases, Vector Embedding, and Indexing

Distance metrics play a vital role in the organization, retrieval, and comparison of high-dimensional data, especially in the context of vector databases, vector embeddings, and indexing systems. Here’s how these metrics are integrated into these areas:

1. Vector Databases

Vector databases store and manage high-dimensional vector data (such as text, image, or audio embeddings). The ability to perform efficient similarity searches, such as finding the closest vectors to a query vector, is critical in these systems.

Role of Distance Metrics

Similarity Search: When you search for similar vectors in a vector database (e.g., finding images similar to a query image), the database uses a distance metric like Euclidean distance or cosine similarity to rank vectors by their proximity to the query.

Nearest Neighbor Search (k-NN): Vector databases frequently perform k-nearest neighbor (k-NN) queries. The choice of distance metric (Euclidean, Cosine, etc.) directly impacts the results, as it defines how the nearest vectors are selected.

For example, if you’re storing word embeddings in a vector database, cosine similarity might be used to find words with similar meanings by measuring the angle between vectors in high-dimensional space.

Example Use Case
Image Search: In a database where each image is represented as a high-dimensional embedding, a user uploads a query image, and the system retrieves images that are closest to the query based on Euclidean distance between embeddings.

2. Vector Embedding

Vector embeddings are representations of objects (such as words, images, or audio) in a continuous, dense vector space. These embeddings capture the semantic or structural relationships between objects, where similar objects have vectors close to each other.

Role of Distance Metrics

Clustering Embeddings: Distance metrics are crucial for clustering embeddings. For example, when performing K-Means Clustering on word embeddings, Euclidean distance can help to group similar words into clusters based on their proximity in the vector space.

Measuring Semantic Similarity: In text embeddings (like word2vec or BERT), cosine similarity is often used to determine the similarity between words or sentences. This enables tasks like semantic search or recommendation systems.

For example, in recommendation systems, embeddings of user preferences and product features can be compared using cosine similarity to suggest the most relevant items.

Example Use Case
Text Embedding for Search: A search engine might embed all its documents into vector form. When a user submits a search query, the system embeds the query and retrieves the top documents that are closest to the query vector, often using cosine similarity.

3. Indexing

In large-scale vector databases, searching through millions or billions of vectors to find similar ones can be computationally expensive. To address this, vector indexing methods like LSH (Locality Sensitive Hashing), HNSW (Hierarchical Navigable Small World graphs), or Annoy (Approximate Nearest Neighbors Oh Yeah) are used to improve the efficiency of similarity search.

Role of Distance Metrics
Efficient Search: Indexing algorithms use distance metrics like Euclidean distance or Manhattan distance to organize vectors into efficient data structures (such as trees or graphs) that allow for fast retrieval of nearest neighbors.

Approximate Nearest Neighbor (ANN): In high-dimensional spaces, exact nearest neighbor searches can be very slow. Indexing methods often rely on approximate methods to find vectors that are “close enough” to the query vector, using distance metrics to balance speed and accuracy.

For example, Annoy uses Euclidean distance and Angular distance (cosine similarity) to create trees that partition the vector space and enable quick nearest-neighbor searches.

Example Use Case
Real-time Recommendations: In e-commerce, a system might need to recommend products in real-time based on the user’s past behavior. A vector index using HNSW can retrieve similar products quickly by using cosine similarity to compare vectors, ensuring fast and relevant recommendations.

4. Improving Accuracy with Different Distance Metrics

Depending on the nature of the data, the choice of distance metric can significantly impact the performance of vector embeddings, indexing, and databases:

High-dimensional data (e.g., word embeddings) often benefits from using cosine similarity because it measures the angle between vectors, focusing on the direction rather than magnitude.

Dense, continuous data (e.g., image embeddings) may be better suited for Euclidean distance, as this measures the absolute difference in vector magnitudes.

Example Use Case
In a multimedia search engine, storing image and text embeddings, using the right distance metric for each type of data can lead to improved search accuracy and user satisfaction. Cosine similarity might be used for text, while Euclidean distance could be used for images to deliver better search results based on the nature of the data.

5. Multi-Metric Use Cases

In some systems, multiple distance metrics are used depending on the task or type of data. For example:
Hybrid systems may use cosine similarity for text-based searches and Euclidean distance for visual searches, depending on whether the task involves semantic similarity or physical resemblance.

Hierarchical clustering algorithms can switch between different distance metrics at various levels of granularity, starting with Manhattan distance for coarse comparisons and refining results with Minkowski distance for more precise grouping.

Conclusion

In vector databases, vector embeddings, and indexing systems, distance metrics serve as the foundation for organizing, comparing, and retrieving data. The choice of metric — whether it be Euclidean, Manhattan, Cosine, or Hamming — has profound implications for performance, accuracy, and efficiency. Understanding the right distance metric for a given task is critical for building powerful machine learning applications and large-scale search systems.

By selecting and implementing the appropriate distance measure, vector-based systems can optimize tasks like similarity search, recommendation, clustering, and efficient indexing of high-dimensional data.

--

--