Relationship between Cosine Similarity and Euclidean Distance.
Many of us are unaware of a relationship between Cosine Similarity and Euclidean Distance. Knowing this relationship is extremely helpful if we need to use them interchangeably in an indirect manner. One application of this concept is converting your Kmean Clustering Algorithm to Spherical KMeans Clustering algorithm where we can use cosine similarity as a measure to cluster data.
Use Case:-
We often want to cluster text documents to discover certain patterns. K-Means clustering is a natural first choice for clustering use case. K-Means implementation of scikit learn uses “Euclidean Distance” to cluster similar data points.
It is also well known that Cosine Similarity gives you a better measure of similarity than euclidean distance when we are dealing with the text data.
So We may want to run Kmeans using cosine distance which is not possible in the case of scikit learn implementation.
We can use hack — if some how convert euclidean distance as some proportionate measure of cosine distance then this can be achieved.
Mathematics
Proof with Code
import numpy as np
import logging
import scipy.spatial
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics.pairwise import euclidean_distancesnp.random.seed(42)test_array = np.random.rand(3,100)for item in range(test_array.shape[0]):
element = test_array[item]
print (element.transpose().dot(element)
output
30.868488161326475
33.289148886695116
35.31491104309238
Normalizing Vectors
X_normalized = preprocessing.normalize(test_array, norm='l2')
euclidean_dist = euclidean_distances(X_normalized)
squared_euclidean = np.square(euclidean_dist)
print (squared_euclidean)
output
[[0. 0.55794124 0.54552104]
[0.55794124 0. 0.56962493]
[0.54552104 0.56962493 0. ]]
Computing cosine similarity
adjusted_cosine_distance = 2 - 2*cosine_similarity(X_normalized)
print (adjusted_cosine_distance)
output
[[6.66133815e-16 5.57941240e-01 5.45521039e-01]
[5.57941240e-01 2.22044605e-16 5.69624926e-01]
[5.45521039e-01 5.69624926e-01 6.66133815e-16]]
We can see from above that when vectors u and v are normalised then there exist a relationship between cosine similarity and euclidean distance.
For Normalised Vectors:
Euclidean Distance (u,v) = 2 * (1- Cosine Similarity(u,v))
Euclidean Distance (u,v) = 2 * Cosine Distance(u,v)
Hack :- So in the algorithms which only accepts euclidean distance as a parameter and you want to use cosine distance as measure of distance, Then you can convert input vectors into normalised vector and you will get results as per the cosine distance.
Hope above explanation has cleared your understanding about relationship between euclidean distance and cosine similarity.