Cosine Similarity: Are you heading the right direction?

Today, I want to introduce a very simple, yet powerful, technique for recommendation — Cosine Similarity. However, before I discuss about it, an important analogy — Vector Space Model — has to be introduced.

Vector Space Model

Generally, recommendation system, especially text based ones, are based around examining the “closeness” between two contents. That is, answering the question: Are they talking, or representing, the same thing? That question in itself is very semantic based. Unfortunately, assigning meaning to a list of characters (which is basically what a word is) with minimal human training is a tough challenge, one still trying to be solved effectively and accurately by leading scientists today. Meanwhile, there are numerous powerful mathmatical, statistical techniques within the Data Mining community that cannot be used because it is semantics free.

Vector Space Model (VSM) is a powerful analogy to examine semantic based problems. Instead of looking at a text article through the meaning of each word, VSM assumes that the article as whole is a bag of words, where each word occurs independently of each other. Therefore, if we treat each word as their own independent meaning, how many times the word occurs means how much far the article leans toward the word’s meaning. With that analogy, we can map an article onto a multi-dimensional space as a vector, with each coordinate based on how often each word occurs. Furthermore, a collection of articles, such as the web, would become numerous vectors. Suddenly, all the powerful and mature linear algebraic techniques from the Data Mining community can be used.

Now, some intuitions regarding expressing the similarity between two vectors might point to using the end points for comparison. For example, it sounds like a great idea to compare two articles based on the the Euclidean Distance between their end points. However, that approach’s flaw is exposed when an article is a copy of the other, but simply with the whole text duplicated multiple times. That is, if one article contains “x, y, z” as text, then the other contains “x, y, z, x, y, z, x, ,y, z”. In this case, the Euclidean appraoch fails because even though they are exactly the same semantically, they are 2 units apart.

Cosine Similarity

To remedy that flaw, another approach is introduced. Cosine Similarity (CosSim) accounts for both the ratio between the different words, A:B:C in the previous example, while discarding the total word count of each article. In other words, CosSim normalizes all article vectors to have a magnitude of 1 unit while maintaing the ratio between the words. This appraoch is also referred to as Simplex. The key to enforce that is through calculating the angles of the vectors. This way, we can avoid normalizing each article vectors explicitly while factor out the repeating sets of words.

How about a meta-anology? Imagine a white paper as a pole sticking out of the ground, it is free from influences because no words are indicating what it’s really about. As we write more words to the paper, each word is symbolized a rope pulling the pole toward that direction. If we happen to use a word more than once, the pull of that word’s rope grows proportionally stronger. As we finish writing the paper, the final tilte of the pole is a representation of what the aritcle is about. Now imagine two articles, thus two poles, stem from the same point on the ground. A way to express how close these two polls are is through the angle of made from these two tilted poles. That is cosine similarity.

Okay, I think I have wet your appitite enough. Now that I’ve explained the reasoning behind using VSM and CosSim, I can reveal the greatest magic behind cosine similarity. The equation for calculating CosSim between two article vectors is A * B / ||A|| ||B||, or the dot product of vector A and B divided by the product of the magnitude (length) of vector A and the magitude of vector B. That’s it!

There are many tools in all forms of programming languages to perform these operations; if not, they are very easy to implement as well. The specific difficalty I’ve come across while implementing this methold is to transform sparse representation, which takes less resource, to one which encompass all dimensions (words) from both article vectors. Yet, even this challenge can be easily overcome using set operations such as uniting two sets of words.

Cosine similarity, however, is no way the end game for recommendation. It is not the best solution, and many state-of-the-art techniques outshine CS in terms of accuracy. However, it’s true strength is in it’s similicity and flexibility. More often than not, it will not give you the best accuracy you are seeking, but it will provide an acceptable solution with minimal effort.

Originally published at