The Significance of Distance and Similarity measures in Clustering

Published in

AI Skunks

18 min readMar 14, 2023

Abstract

Clustering is an unsupervised learning technique that helps in finding natural groupings in data. It is widely used in various fields such as marketing, biology, and social sciences. However, clustering algorithms require the use of distance or similarity measures to determine the similarity between data points. In this article, we will explore the different distance and similarity measures used in clustering and how they can impact the clustering results. We will also provide real-life examples and use Python to demonstrate these measures with a dataset.

Introduction:

Imagine you are a fashion retailer trying to understand the purchasing habits of your customers. By applying clustering techniques to your customer data, you can group together customers who share similar buying patterns, such as those who prefer high-end fashion, those who frequently buy discounted items, or those who favor athletic wear. This can help you to tailor your marketing efforts and product offerings to different customer segments, increasing sales and customer satisfaction. Clustering is the process of dividing a set of objects into groups or clusters based on their similarities. The similarity between objects is typically measured using a distance or similarity measure. The choice of distance or similarity measure can greatly impact the clustering results, as different measures can lead to different clusters. In this article, we will explore the different distance and similarity measures used in clustering and their impact on the clustering results.

Metric and Metric Spaces:

a metric is simply the mathematical notion of a distance function, with certain well-behaved properties.

Given a set X, we say X is a metric space if it comes equipped with a special function d(x, y) that can compute the distance between any two points x, y of X. Specifically, d must satisfy the axioms of a metric.

d: X×X→R

must satisfy the following properties for any choice of elements x, y, z ∈M.

d(x, y) ≥0 (non−negative),
d(x, y) = 0,⟺ x = y (identity of indiscernible ),
d(x, y) = d(y, x), (symmetry).
d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality)

The first bullet claims that the distance between any two things can never be negative (hence called “non-negativity”) and that the distance between two things can only be zero if those two things are actually the same thing.

The third bullet is a matter of perspective; “the distance function reads, “the distance between x and y” and this shouldn’t change based on which element comes first in the sentence. This is the “symmetry” condition.

Triangle inequality says that the lengths of edges of triangles make sense if you measure them with d. Thinking of x, y, and z as the vertices of a triangle, such as the one at left, we don’t want the length of one edge to be longer than the combined lengths of the other two edges. It’s a basic fact of Euclidean geometry that such a triangle cannot be drawn.

Clustering Data Structures:

The dissimilarity/similarity matrix is calculated by iterating over each element and calculating its dissimilarity/similarity to every other element. Let A be a Dissimilarity Matrix of size NxN, and B a set of N elements. Ai,j is the dissimilarity/similarity between elements Bi and Bj.

for i = 0 to N do
       for j = 0 to N do
           Aij = Dissimilarity(Bi,Bj) // or Similarity(Bi,Bj)
       end-for
   end-for

Distance and Similarity Measures:

Clustering algorithms are constructed using similarity and distance (dissimilarity). Regarding the characteristics of quantitative data, distance is preferred to identify the relationship between the data. And when working with qualitative data aspects, the similarity is preferred. The distance metric is essential to clustering techniques. A difficult task is choosing the appropriate distance metric for a particular dataset. Distance-based clustering algorithms primarily utilize similarity or distance metrics to group comparable data points into the same clusters while putting dissimilar or far-away data points into different clusters. To the best of our knowledge, no empirical work has shown how similarity measures behave when dealing with high-dimensional datasets; similarity measures are often handled in two or three-dimensional spaces. An effective distance metric improves the performance of our machine learning model, whether that’s for classification tasks or clustering.

“Distance is a way to measure similarity, and similarity is a way to measure distance.” — Xiaowei Xu

Distance Measures:

Distance measures are used to determine the dissimilarity between two data points. There are several distance measures commonly used in clustering, let’s explore some of these measures:

Euclidean Distance: This is the most common distance measure used in clustering. It works on the principle of the Pythagoras theorem and signifies the shortest distance between two points. Euclidean distance can be used for data with continuous variables, such as height, weight, or age. if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:

Calculate Euclidean distance using Math Import function:

import math

def euclidean_distance(point1, point2):
    # Calculate the sum of the squared differences between the coordinates
    squared_diffs = [(point1[i] - point2[i]) ** 2 for i in range(len(point1))]
    sum_squared_diffs = sum(squared_diffs)
    # Take the square root of the sum to get the Euclidean distance
    distance = math.sqrt(sum_squared_diffs)
    return distance

point1 = (1, 2, 3)
point2 = (4, 5, 6)
distance = euclidean_distance(point1, point2)
print(distance)

Output: 5.196152422706632

2. Manhattan Distance: This distance measure is also known as city block distance or taxicab distance. Manhattan distance can be used for data with discrete variables, such as the number of bedrooms in a house. It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid). It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space. You can look for more formal definition here

from math import sqrt

#create function to calculate Manhattan distance 
def manhattan(a, b):
    return sum(abs(val1-val2) for val1, val2 in zip(a,b))
 
#define vectors
A = [2, 4, 4, 6]
B = [5, 5, 7, 8]

#calculate Manhattan distance between vectors
manhattan(A, B)

Output: 9

3. Cosine Similarity: This similarity measure is commonly used for text-based data or other high-dimensional data. It measures the cosine of the angle between two vectors in a multidimensional space. The cosine similarity ranges from -1 to 1, with 1 indicating that the vectors are identical a value of 0 indicating that they are orthogonal (perpendicular) to each other.

It is given by the formula-> a ⋅ b = ∥a∥∥b∥cosθ

Cosine similarity cares only about the angle between the two vectors and not the distance between them. Assume there’s another vector c in the direction of b

Now, What would be the cosine similarity between b and c? The cosine similarity between b and c is 1 since the angle between b and c is 0 and cos(0) = 1.

Even though the distance between b and c is large compared to a and b cosine similarity cares only about the direction of the vector and not the distance.

def cos_sim(a, b):
 """Takes 2 vectors a, b and returns the cosine similarity according 
 to the definition of the dot product
 """
 dot_product = np.dot(a, b)
 norm_a = np.linalg.norm(a)
 norm_b = np.linalg.norm(b)
 return dot_product / (norm_a * norm_b)

# the counts we computed above
sentence_m = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0]) 
sentence_h = np.array([0, 0, 1, 1, 1, 1, 0, 0, 0])
sentence_w = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1])

# We should expect sentence_m and sentence_h to be more similar
print(cos_sim(sentence_m, sentence_h)) # 0.5
print(cos_sim(sentence_m, sentence_w)) # 0.25

Output: 0.5, 0.25

4. Mahalanobis Distance: measures the distance between a point and a distribution. It takes into account the covariance of the data, and is useful when the data is non-spherical or has correlated dimensions.

Similarity Measures:

Similarity measures are used to determine the similarity between two data points. There are several similarity measures commonly used in clustering, including:

Jaccard Similarity: This similarity measure is commonly used for binary data, such as the presence or absence of a feature. It is defined as the intersection of the two sets divided by the union of the two sets.

a = [0, 1, 2, 5, 6, 8, 9]
b = [0, 2, 3, 4, 5, 7, 9]

#define Jaccard Similarity function
def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

#find Jaccard Similarity between the two sets 
jaccard(a, b)

Output: 0.4

2. Pearson Correlation: This similarity measure is commonly used for continuous variables. It measures the linear correlation between two variables. The Pearson correlation ranges from -1 to 1, A Pearson correlation of 1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable also increases at a constant rate. A Pearson correlation of -1 indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases at a constant rate. A Pearson correlation of 0 indicates that there is no linear relationship between the two variables.

Pearson product-moment correlation coefficient is commonly represented by the Greek letter ρ (rho) and is defined as:

where cov is the covariance and σX is the standard deviation of X. The formula for ρ can be expressed in terms of mean and expectation.

cov (X, Y)=E[(X−μX)(Y−μY)]

3. Hamming Distance: This similarity measure is commonly used for data with categorical variables. It measures the number of positions at which two strings differ.

The significance of Measuring Distances and Similarity in Clustering:

Helps to Identify Patterns and Relationships: Measuring distances and similarity between data points is crucial in identifying patterns and relationships between data points. By calculating the distance or similarity between data points, clustering algorithms can group similar data points together, and identify clusters that are not immediately visible.
Improves Accuracy: Measuring distances and similarities between data points can significantly improve the accuracy of clustering algorithms. By accurately measuring the distances or similarities between data points, clustering algorithms can group similar data points together, and minimize the number of misclassified data points.
Reduces Computational Complexity: Measuring distances and similarities between data points can help to reduce the computational complexity of clustering algorithms. By using efficient distance or similarity measures, clustering algorithms can process large datasets quickly and accurately.
Enables the Use of Different Clustering Algorithms: Measuring distances and similarities between data points enables the use of different clustering algorithms. Different algorithms require different distance or similarity measures, and the ability to measure distances and similarities accurately enables the use of a wide range of clustering algorithms.

Which measures should we chose then?

The choice of distance metrics is crucial since it has a significant impact on the outcomes of the clustering. The default distance measurement for the majority of clustering tools is the Euclidean distance.

Several dissimilarity measures may be selected depending on the data type and the research questions. For instance, the analysis of gene expression data frequently use correlation-based distance. Below I have made a list of few of the measures with their advantages, disadvantages and applications:

Clustering Algorithms:

When choosing a clustering algorithm, you should consider whether the algorithm scales to your dataset. Datasets in machine learning can have millions of examples, but not all clustering algorithms scale efficiently. Many clustering algorithms work by computing the similarity between all pairs of examples. Some of the traditional algorithms are shown in the below picture, we will be discussing some of these algorithms.

Partitioning-based

Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.

These techniques require the user to specify the number of clusters, indicated by the variable k. Many partitional clustering algorithms work through an iterative process to assign subsets of data points into k clusters. Two examples of partitional clustering algorithms are k-means and k-medoids.

These algorithms are both nondeterministic, meaning they could produce different results from two separate runs even if the runs were based on the same input.

Partitional clustering methods have several strengths:

They work well when clusters have a spherical shape.
They’re scalable with respect to algorithm complexity.

They also have several weaknesses:

They’re not well suited for clusters with complex shapes and different sizes.
They break down when used with clusters of different densities.

K-Means:

K-means is a popular unsupervised learning algorithm used for clustering. The goal of the algorithm is to divide a set of points into K clusters, where each point belongs to the cluster with the nearest mean.

Here’s how the algorithm works:

Initialization: Choose K initial centroids randomly from the set of points.
Assignment step: Assign each point to the nearest centroid, creating K clusters.
Update step: Calculate the mean of the points in each cluster and move the centroid to the mean. We repeat steps 2 and 3 until the centroids no longer move or a maximum number of iterations is reached.

Choosing the appropriate number of clusters is one of the difficulties with K-means (K). The elbow approach, which depicts the sum of squared distances between each point and its assigned centroid as a function of the number of clusters, is one popular technique.

Hierarchical Based Clustering

Hierarchical clustering is a type of clustering algorithm used to group objects based on their similarities or dissimilarities. In hierarchical clustering, objects are grouped together in a hierarchical manner, forming a tree-like structure called a dendrogram. The two main types of hierarchical clustering are agglomerative and divisive clustering.

Agglomerative clustering is a bottom-up approach, where each object initially represents a single cluster, and the algorithm merges the closest clusters together until all objects belong to the same cluster. The distance between clusters is measured using a linkage criterion, such as single-linkage, complete-linkage, or average-linkage.

Divisive clustering, on the other hand, is a top-down approach, where all objects initially belong to a single cluster, and the algorithm recursively splits the cluster into smaller sub-clusters until each object forms its own cluster. Divisive clustering is generally less commonly used than agglomerative clustering, as it is computationally expensive.

Density Based

Density-based clustering is a clustering technique that groups together data points that are closely packed together in dense regions of the data space. The basic idea behind density-based clustering is to define a dense region as a region with a high density of data points, and to consider data points that are close to each other as belonging to the same cluster.

The most commonly used density-based clustering algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which works by defining a neighborhood around each data point, based on a specified radius, and then clustering together points that lie within a certain density threshold.

The main advantage of density-based clustering is that it can identify clusters of arbitrary shapes and sizes, and it is less sensitive to outliers and noise in the data than other clustering techniques. However, it can be sensitive to the choice of parameters, such as the radius and density threshold used to define the neighborhoods.

Some of the applications of density-based clustering include image segmentation, anomaly detection, and pattern recognition in data mining.

Demonstration:

To demonstrate the use of distance and similarity measures in clustering, we will use the Breast Cancer Wisconsin dataset. The dataset contains 569 attribute instances. We will use the K-means clustering algorithm with different distance and similarity measures to cluster the tumor as malignant (cancerous) or benign (non-cancerous).

Start by importing the libraries and loading the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans, DBSCAN, Birch
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from pyclustering.cluster.clique import clique, clique_visualizer

# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()
cancer_df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

We can look at basic information about the dataset using info(), describe() methods

Next, let’s visualize the distribution of the target variable using a countplot:

sns.countplot(data=cancer_df, x='target')
plt.title('Distribution of Target Variable')
plt.show()

Finally, let’s visualize the distribution of the features using histograms:

features = cancer_df.columns[:-1]
cancer_malignant = cancer_df[cancer_df['target'] == 0]
cancer_benign = cancer_df[cancer_df['target'] == 1]

fig, axs = plt.subplots(nrows=10, ncols=3, figsize=(20,50))
for i, feature in enumerate(features):
    ax = axs[i//3, i%3]
    ax.hist(cancer_malignant[feature], bins=20, alpha=0.5, color='red', label='Malignant')
    ax.hist(cancer_benign[feature], bins=20, alpha=0.5, color='blue', label='Benign')
    ax.set_xlabel(feature)
    ax.set_ylabel('Count')
    ax.legend()
plt.show()

This will show us the distribution of each feature in the dataset, separated by the target variable. We can see that some features have different distributions for malignant and benign samples, which indicates that they may be important for clustering analysis.

Note: To look at the whole output, kindly refer the colab notebook link of which have been shared at the end of the article.

Feature Selection:

Before performing clustering analysis, it’s important to select the relevant features to include in the analysis. One way to do this is to use feature selection methods, such as the Select K Best method from scikit-learn, which selects the top k features based on their scores using a given statistical test.

from sklearn.feature_selection import SelectKBest, f_classif

X = cancer_df.iloc[:, :-1]
y = cancer_df.iloc[:, -1]

# Select the top 10 features
selector = SelectKBest(f_classif, k=10)
selector.fit(X, y)
X_selected = selector.transform(X)

# Get the names of the selected features
selected_features = X.columns[selector.get_support(indices=True)]
print('Selected features:', selected_features)

This will select the top 10 features based on their F-test scores and transform the dataset to only include these features.

Dimensionality Reduction:

After selecting the relevant features, we can perform dimensionality reduction to reduce the number of features in the dataset and to visualize the data in lower dimensions. One way to do this is to use principal component analysis (PCA), which transforms the dataset into a new set of orthogonal variables that explain the maximum variance in the data.

from sklearn.decomposition import PCA

# Perform PCA on the selected features
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_selected)

# Visualize the data in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.title('PCA on Breast Cancer Wisconsin Dataset')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.show()

This will perform PCA on the selected features and visualize the data in 2D, with the colors representing the target variable. We can see that the data is well separated into two clusters based on the target variable.

Clustering Analysis:

After performing feature selection and dimensionality reduction, we can perform clustering analysis on the reduced dataset. One way to do this is to use the k-means algorithm, which partitions the data into k clusters based on the distance between the data points and the cluster centers.

from sklearn.cluster import KMeans

# Perform k-means clustering on the reduced dataset
kmeans = KMeans(n_clusters=2, random_state=42)
y_pred = kmeans.fit_predict(X_pca)

# Visualize the clustering results in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred)
plt.title('k-means Clustering on Breast Cancer Wisconsin Dataset')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.show()

This will perform k-means clustering on the reduced dataset and visualize the clustering results in 2D, with the colors representing the predicted clusters.

We can see that the k-means algorithm is able to separate the data into two clusters that are similar to the original target variable. However, there is some overlap between the clusters, which indicates that the clustering may not be perfect.

Overall, the feature selection, dimensionality reduction, and clustering analysis give us an understanding of the relevant features in the Breast Cancer Wisconsin dataset,

Before jumping in to perform the algorithms, lets first understand “Silhouette Score”, So What is it?

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

Here’s an example of how to perform KMeans clustering using Euclidean distance:

# Perform KMeans clustering with Euclidean distance
kmeans_euclidean = KMeans(n_clusters=2)
kmeans_euclidean.fit(X_pca)
y_kmeans_euclidean = kmeans_euclidean.predict(X_pca)

# Visualize the results using PCA
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans_euclidean)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('KMeans Clustering with Euclidean Distance')
plt.show()

# Calculate the silhouette score
silhouette_score_euclidean = silhouette_score(X_pca, y_kmeans_euclidean)
print(f'Silhouette score for KMeans clustering with Euclidean distance: {silhouette_score_euclidean}')

Perform BIRCH clustering with Cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(X_pca)

# Perform BIRCH clustering with cosine similarity
birch_cosine = Birch(n_clusters=2, threshold=0.5)
birch_cosine.fit(cosine_sim)
y_birch_cosine = birch_cosine.predict(cosine_sim)

# Visualize the results using PCA
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_birch_cosine)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('BIRCH Clustering with Cosine Similarity')
plt.show()

# Calculate the silhouette score
silhouette_score_cos = silhouette_score(X_pca, y_birch_cosine)
print(f'Silhouette score for BIRCH clustering with Cosine distance: {silhouette_score_cos}')

KMeans with Mahalanobis Distance

# Perform KMeans clustering with Mahalanobis distance
kmeans_mahalanobis = KMeans(n_clusters=2)
kmeans_mahalanobis.fit(X_pca)
y_kmeans_mahalanobis = kmeans_mahalanobis.predict(X_pca)

# Visualize the results using PCA
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans_mahalanobis)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('KMeans Clustering with Mahalanobis Distance')
plt.show()

# Calculate the silhouette score
silhouette_score_mahalanobis = silhouette_score(X_pca, y_kmeans_mahalanobis)
print(f'Silhouette score for KMeans clustering with Mahalanobis distance: {silhouette_score_mahalanobis}')

Conclusion

In conclusion, distance and similarity measures play an important role in clustering algorithms. The choice of distance or similarity measure can greatly impact the clustering results, and different measures may be more appropriate for different types of data. In this article, we explored several common distance and similarity measures used in clustering and provided real-life examples. We also demonstrated the use of these measures with a dataset of Breast Cancer Wisconsin using Python. Clustering is a powerful technique that can help uncover natural groupings in data and provide insights into complex systems.

References

Sckit learn offcial documentation
Refered Towards Data Science The algorithms were referred directly from the Sckit learn official documentation. Visualization was referred from the Machine Learning with scikit-learn Quick Start Guide and Towards Data Science . The remaining code was written independently.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144059#:~:text=Similarity%20or%20distance%20measures%20are,are%20placed%20into%20different%20clusters.
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www-users.cse.umn.edu/~kumar001/dmbook/ch8.pdf
https://machinelearningmastery.com/distance-measures-for-machine-learning/

MIT License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE