Cosine distance and cosine similarity
-Okay, Milana, there is a mistake: cosine similarity cannot be negative.
- Oh, it can be.
Let’s discuss a few questions about cosine similarity and cosine distance which are the most important concepts in NLP.
What is cosine similarity?
Cosine similarity is a metric that determines how two vectors (words, sentences, features) are similar to each other. Basically, it is an angle between two vectors.
What is the range for cosine similarity?
Similarity range is between -1 to 1, where -1 absolutely opposite vectors (python — security of code), 0 no correlation (university knowledge — work), 1 absolutely similar (chatgpt — hype). It can be explained again by the angles, remember the cosine of two vectors that point to different directions. The angle between them is 180, they are opposite, cosine is equal to -1.
What is cosine distance?
cosine distance = 1 — cosine similarity
Range of cosine distance is from 0 to 2, 0 — identical vectors, 1 — no correlation, 2 — absolutely different.
Why use cosine distance?
While cosine similarity measures how similar two vectors are, cosine distance measures how different they are.
In real applications it depends on the task what function to choose. You can use cosine similarity as a loss function or as a measure for clustering.
You can check nn.CosineSimilarity and nn.CosineEmbeddingLoss, I personally use the second one as a loss function to learn embeddings of pairs.
Normalisation, how does it affect cosine similarity?
Well, it depends….on the type of normalisation you use. If you use the normalisation technique we have discussed previously, you will see no difference. However, when you normalise your embeddings with Z-score normalisation (v — mean)/ std, you will get different results as both your mean and std change. Just keep it in mind.
v = np.array([1, 2, 3, 4, 5])
w = np.array([6, 7, 3, 1, 8])
cos_before_norm = np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))
print(f'Cosine similarity before normalisation is {cos_before_norm}')
normalised_v = (v - np.mean(v)) / np.std(v)
normalised_w = (w - np.mean(w)) / np.std(w)
cos_after_norm = np.dot(normalised_v, normalised_w) / (np.linalg.norm(normalised_v) * np.linalg.norm(normalised_w))
print(f'Cosine similarity after Z-score normalisation is {cos_after_norm}')
#Cosine similarity before normalisation is 0.7806258942254461
#Cosine similarity after Z-score normalisation is -0.10846522890932814
v = np.array([1, 2, 3, 4, 5])
w = np.array([6, 7, 3, 1, 8])
cos_before_norm = np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))
print(f'Cosine similarity before normalisation is {cos_before_norm}')
normalised_v = v / np.linalg.norm(v)
normalised_w = w / np.linalg.norm(w)
cos_after_norm = np.dot(normalised_v, normalised_w) / (np.linalg.norm(normalised_v) * np.linalg.norm(normalised_w))
print(f'Cosine similarity after normalisation with length is {cos_after_norm}')
#Cosine similarity before normalisation is 0.7806258942254461
#Cosine similarity after normalisation with length is 0.780625894225446
Soft cosine measure
One of the common changes suggested in 2014 the basic idea is that we use additional similarity matrix for all features of vectors. I consider it as an efficient method for TF-IDF case. For instance, when we need to show that such words as play and game are close to each other in semantic sense, but it cannot be captured with cos similarity of just sentences “a player will play a game they like to play”, “they play the game they like”. Similarity matrix can be constructed by word-embeddings similarity, Levenshtein or whatever you want.
A take-away message: cosine similarity is a very important concept, but you should not use it without check of normalisation or semantic similarity between your vectors.
Thank you for your time! Ask questions in the comments and let’s keep in touch in LinkedIn.
If you found the information helpful and want to thank me in other ways, you can buy me coffee.