A comparison of cosine similarity vs Euclidean distance in ALS recommendation engine
A complete analysis of finding similar movies in two methods with ALS results
In the recommendation system, a common function is to find similar movies or users by making use of the ALS results. With ALS matrix factorization, we can easily achieve this by using the latent factors it derives.
The problem is which similarity metric to choose: there’re several popular ways/methods to compute the similarity, and among which the two most common methods are cosine similarity and euclidean distance.
Here are the mathematical definitions of each:
Cosine similarity measures the angle between the two vectors:
Euclidean distance measures the distance of two points:
According to my research, it’s suggested to apply cosine similarity when the magnitude of features doesn’t matter or even disrupt the outputs (for example, in finding similar text by word frequency), and apply Euclidean distance when the magnitude is important (for example, if the factors have actual meaning).
However, it’s still confusing for me to understand the actual eFfects simply on the definition or abstract case description, so I decided to explain the different results of the two metrics with some simple examples:
Eg1. Finding similar animals. Cosine similarity may find a cat more similar to a tiger because they are both from the feline family; while Euclidean distance may find a wolf more similar to a tiger because they are both fierce predators of similar size.
Eg2. Finding similar plants. Cosine similarity may find a small yellow flower more similar to a large white flower of the same family, while Euclidean distance may find a large red flower more similar to a large white flower of a different family but similar size.
Eg3. Finding similar songs. In the second link above, the author uses Euclidean distance that will choose songs that have similar values on three factors (valence, energy, and key) in total, while cosine will choose songs that have a similar ratio/distribution across the factors.
Eg4. Finding similar students. A is the top student with exam scores of 100, 100, 100 for English, Maths, Geography respectively. B is a Math genius with scores of 50, 150, 50. C is a normal student with scores of 70, 70, 70. Cosine similarity will find A and C are similar (computed similarity is 1.00>0.87), while Euclidean distance will find A and B are similar (86.60>51.96).
Eg5. Finding similar movies (in theory). Back to our case, based on the above inference, I expect cosine similarity to find movies that have similar distribution on four latent factors (which might imply similarity in genres or structure of the movie, etc.), while Euclidean distance will find those similar in the factor values. (However, since we cannot make sense of its factors, we cannot truly understand how ALS works or predict the results.)
In a nutshell, cosine similarity is the similarity of ratio/scale, while Euclidean distance is the similarity of actual value.
Now it seems better to understand right? (At least this works for me.)
We can see that both methods make sense in a way, so it is up to the user to decide which result can best fulfill their purpose. In the example of finding similar students, if someone values the balance across different subjects, then they would probably prefer cosine similarity; if someone values the outstanding intelligence, they might think Euclidean distance makes more sense.
Now let’s go back to the movie recommendation engine and see how the two methods perform in the case of finding similar movies. (GitHub link to my movie recommendation engine for reference)
To begin with, I would like to assume that if a movie is from a series, then the most similar movies must include other movies in the same series. So I decided to search similar movies for Harry Potter and the Sorcerer’s Stone (2001) and expected to get results containing other Harry Potter movies.
Here’s what I got from both similarity metrics:
Well, maybe later Harry Potter movies had changed drastically from the first one, or at least in the model’s opinion. This could be possibly true since the director and actors changed and the plot developed over the ten years. I am most relieved to see cosine similarity has put HP and the Chamber of Secrets (2002) at the top of its list.
How about The Avengers (2012)? Could ALS find other Marvel blockbuster movies?
Again, we could see that cosine similarity has found two directly related movies of Captain America, and most of the other movies are also of Action/Adventure genres.
On the contrary, Euclidean distance again finds movies I am not familiar with (because I wasn’t a huge movie lover, to be honest), so it’s hard for me to make sense of the results. The only one I know was Ape (2012), which seems to be a blockbuster. And the genres are quite confusing as well.
When comparing the results by Euclidean distance, I was surprised to find that they contain a lot of the same movies for the two different searches! I thought that Harry Potter and Marvel series were not that similar to each other? Maybe someone familiar with these movies could understand Euclidean’s choice.
After all the analysis, I prefer to use cosine similarity to compute similarity scores. However, an ultimate problem still exists: when searching for similar movies, which kind of similarity are people actually searching for? It seems to largely depend on each person’s preference, so the only solution is to let users manually choose which movies are more similar in their opinion.
That is to say, we’d better use A/B testing to decide which method has a better performance in real life. ❦︎
Comments and feedbacks are welcome! ;-)