Unlocking Insights: Understanding Vector Similarity in Machine Learning Part1

What Is Vector Similarity?

Subhadeep Choudhury
6 min readOct 13, 2023

Vector similarity is like a measure of how alike two things are in a multi-dimensional space. Imagine comparing songs — we turn each song into a vector based on its features and then calculate how close these vectors are. If they point in similar directions, it means the songs are similar; if they point in different directions, they’re different. This idea helps computers make recommendations or find similarities in various things, like music, movies, and games.

Importance Of Vector Similarity In Machine Learning

Vector similarity in machine learning refers to a measure of how similar two vectors are in a multi-dimensional space. Vectors are often used to represent data, and vector similarity is a fundamental concept in various ML tasks, including information retrieval, natural language processing, recommendation systems, and more. Similarity measures are crucial for finding similar documents, recommending products, clustering data, and more.

There are several common similarity measures used in machine learning, and the choice of measure depends on the specific task and the nature of the data. This blog will touch upon various similarity metrics and understand their nuances.

Similarity Measures that will be covered in this blog are

  1. Cosine Similarity
  2. Euclidian Distance(L2)
  3. Manhatten Distance(L1)
  4. Jaccard Similarity
  5. Mahalanobis Distance
  6. Hamming Distance
  7. Minkowski Distance

This blog has been divided into two parts. In the first part, we will discuss Cosine Similarity, Euclidian Distance, and Manhattan Distance. In the second part, we will discuss the remaining similarity measures and understand in detail which similarity measure to use in different problem scenarios.

https://weaviate.io/assets/images/hero-183a22407b0eaf83e53d574aee0a049a.png

Cosine Similarity

Cosine Similarity measures the cosine of the angle between two vectors. Here’s how it works with a mathematical expression:

Imagine you have two sets of numbers, A and B, which represent two things you want to compare. These sets can be thought of as vectors in a multi-dimensional space. The cosine similarity between these two vectors is calculated using this formula:

Cosine Similarity (A, B) = (A • B) / (||A|| * ||B||)

Now, let’s break this down in layman’s terms:

  1. (A • B): This part calculates the dot product of the two vectors A and B. The dot product is a measure of how much the two vectors “overlap” in direction. If they point in the same direction, the dot product is large; if they are perpendicular, the dot product is small.
  2. ||A|| and ||B||: These represent the magnitudes or lengths of the vectors A and B. In other words, they show how long each vector is.

So, the cosine similarity formula measures how much the two vectors A and B are aligned with each other. If they are very similar (i.e., they point in roughly the same direction), the cosine similarity will be close to 1. If they are dissimilar (i.e., they are orthogonal or point in very different directions), the cosine similarity will be close to 0.

Code Implementation Of Cosine Similarity:

import numpy as np

# Define two vectors A and B
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

# Calculate the dot product of A and B
dot_product = np.dot(A, B)

# Calculate the magnitudes (lengths) of A and B
magnitude_A = np.linalg.norm(A)
magnitude_B = np.linalg.norm(B)

# Calculate the cosine similarity
cosine_similarity = dot_product / (magnitude_A * magnitude_B)

print("Cosine Similarity between A and B:", cosine_similarity)

Euclidian Distance(L2)

Euclidean Distance is a measure of the “straight-line” distance between two points in space. It’s a way to calculate the length of the shortest path between these two points. This concept is often used in geometry and data analysis.

In layman’s terms, imagine you have two points, A and B, in a two-dimensional space (like a piece of paper). The Euclidean Distance between these points is calculated using this formula:

Euclidean Distance (A, B) = √((x2 — x1)² + (y2 — y1)²)

Here’s what this formula means:

  • (x1, y1): The coordinates of point A.
  • (x2, y2): The coordinates of point B.

The formula calculates the horizontal (x-axis) and vertical (y-axis) differences between the two points, squares these differences, adds them together, and then takes the square root of the sum to get the final Euclidean Distance.

In simple terms, it’s like finding the length of the shortest path between two points if you were to walk in a straight line from A to B on a flat surface. The distance is a measure of how far apart these two points are in a direct, straight-line fashion.

Euclidean Distance Vs Cosine Similarity In Layman Term

Analogy: Meeting New People at a Party

Imagine you’re at a party, and you want to figure out how similar you are to other people based on two factors: height and interests.

Euclidean Distance:

  • Think of Euclidean distance as measuring the “crowdedness” of the party floor.
  • Your similarity with someone is like being in a room and comparing how far apart you are in terms of height. The shorter the distance, the more similar you are in height.
  • If you’re looking for people with similar heights to have eye-level conversations, Euclidean distance helps you find them.
  • Euclidean distance doesn’t consider the direction or the angle of the room; it just cares about how close you are in terms of height.

Cosine Similarity:

  • Now, think of cosine similarity as measuring how well you “connect” with people based on shared interests.
  • Instead of considering height, cosine similarity focuses on the direction in which people are looking.
  • If you and someone else are facing in the same direction, you have a lot in common in terms of interests (cosine similarity is high).
  • If you’re at different angles, it means you have different interests (cosine similarity is low).
  • Cosine similarity doesn’t care about the distance between people; it’s purely about the angle or alignment of their interests.

Code Implementation Of Euclidean Distance

import numpy as np

# Define two data points as NumPy arrays
point1 = np.array([2, 3, 4])
point2 = np.array([1, 1, 1])

# Calculate the Euclidean distance
distance = np.linalg.norm(point1 - point2)

print("Euclidean Distance:", distance)
https://www.google.com/url?sa=i&url=https%3A%2F%2Ftaketake2.com%2FN102_en.html&psig=AOvVaw2xJe20nsHK9-7ZSCvnPFux&ust=1697282443338000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCLjz5Pnz8oEDFQAAAAAdAAAAABAE

Manhattan Distance(L1)

Imagine a grid with streets running horizontally and vertically, the Manhattan distance is like finding the shortest path between two points by moving along the streets. You can move only at right angles, and you add up the distance you travel along each street to get the total distance.

For example, if you want to find the Manhattan distance between two points, (3, 5) and (8, 9), you would calculate it like this:

Manhattan Distance = |8–3| + |9–5| = 5 + 4 = 9

So, the Manhattan distance between these two points is 9 units.

Mathematically, the Manhattan distance between two points (x1, y1) and (x2, y2) is calculated as:

Manhattan Distance = |x2 — x1| + |y2 — y1|

Code Implementation For Manhattan Distance

def manhattan_distance(point1, point2):
# Making sure that both points have the same dimension
if len(point1) != len(point2):
raise ValueError("Both points must have the same dimension")

# Calculating the Manhattan Distance
distance = sum(abs(x - y) for x, y in zip(point1, point2))
return distance

# Example usage
point1 = (3, 5)
point2 = (8, 9)

distance = manhattan_distance(point1, point2)
print("Manhattan Distance:", distance)

In the next part, we will be discussing the remaining similarity metrics.

Follow me:

Linkedin: https://www.linkedin.com/in/subhadeep-choudhury-109b6b197/

--

--