Linear algebra perspective on Cosine Similarity

Published in

The Startup

5 min readDec 5, 2019

Lots of the clustering methods use different distance metrics to measure similarity. For example, KNN is a similarity-based algorithm — the class membership of the observation depends on the feature similarity to training observations. The similarity depends on the distance of observations from each other, which maybe be calculated using several distance metrics: Manhattan, Minkowski, Euclidean. The most commonly used distance metric is the Euclidean distance. Most of the time we try different types of distance metrics and choose one that yields the best results for your model performance and accuracy.

Euclidean Distance:

Euclidean distance

The Euclidean distance of data objects A and B, is the dotted line w, if we think about in linear algebra way, we can get the euclidean distance of A and B using norm of the vector w. The subtraction of the two vectors results in data object D, where its vector b is essentially the same as the w. We can compute the magnitude of w using Pythagorean theorem or the dot product of D.

Euclidean distance can also be utilized as a measure of similarity, essentially, it is really a measure of dissimilarity: longer the Euclidean distance, more differences. However, the similarity measure is influenced by the vectors’ magnitude, so it is used widely if you care about some actual distance/differences of some objects.

Cosine Distance:

From a linear algebra perspective, we can get the cosine distance, from vector a and b’s dot product, and vector norms: ∥ A∥ and ∥ B∥ are the norm of A and B. The norm or magnitude of a vector is computed using the euclidean distance of its two end points.

Let us use cosine distance to calculate the relationship in a simulated book rating dataset:

Cosine value of 1 is for vectors pointing in the same direction i.e. there are similarities between the two persons’ ratings. At zero for orthogonal vectors i.e. unrelated. Value -1 for vectors pointing in opposite directions(No similarity). In the above case, a good way is to recommend what Jessica Black’s high rated books to Desiree, but recommend what she dislikes to Betty due to the distinctive taste of books. Jessica and Betty are not unrelated, in fact, they probably are very related in book taste but in a negative way.

Cosine Similarity and Pearson Correlation coefficient:

We talked about the Pearson correlation coefficient a lot, it plays such a big part in explaining the relationship. The R-squared in linear regression tells us how much variance is explained by our model, the correlation heatmap can examine the multicollinearity issues … Casual links are hard to establish in the real world, the correlation coefficient seems to be critical for humans to understand the pattern of the world. Intuitively we know that correlation is how things covary together, in other words, when one attribute changes, another change. The degree to which two or more attributes or measurements on the same group of elements show a tendency to vary together.

Cosine similarity is the normalized dot product. The covariance is really the centered average dot product (no normalization), which is unbound, varies from negative infinity to positive infinity. However, the correlation coefficient is the cosine similarity between centered versions of x and y, which are bounded between -1 and 1. Pearson correlation is centered cosine similarity. The standard deviation looks very much like the Euclidean distance, which is the norm of the subtraction of the two vectors.

The difference between the Cosine similarity measure and Pearson coefficient is the invariant of the measurement. If x was shifted to x+1, the cosine similarity would change. What is invariant, the Pearson Correlation, which means the change of x+1 or -1 doesn’t result in the difference to both scale and location changes of x and y.

Cosine Similarity Applied in Data Science/Machine Learning:

Cosine similarity typically works well in certain contexts, especially with text or other sorts of high-dimensional data, but it depends on the problem and data.

Recommender system: in the example dataset. We can recommend based upon people’s tastes in books/movies etc.
Text mining/Natural language processing: Measuring the similarity between text documents: In text mining, there will be TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Vectorization deals with extracting the features in the dataset, in the text mining context, it will be the word frequencies, then converting that to vector.
Dot dot dot I don’t know about yet

I hope this article sheds some light on your perception of thinking about the correlation coefficient and cosine similarity. It helps me to visualize the relationship more intuitively by thinking about their relationship in a vector space and its angle.

Linear algebra perspective on Cosine Similarity

Written by Zoe Zhu