A comparison of cosine similarity vs Euclidean distance in ALS recommendation engine

A complete analysis of finding similar movies in two methods with ALS results

Zishan Cheng
Nerd For Tech
5 min readJun 15, 2021

--

In the recommendation system, a common function is to find similar movies or users by making use of the ALS results. With ALS matrix factorization, we can easily achieve this by using the latent factors it derives.

The problem is which similarity metric to choose: there’re several popular ways/methods to compute the similarity, and among which the two most common methods are cosine similarity and euclidean distance.

Here are the mathematical definitions of each:

Cosine similarity measures the angle between the two vectors (the closer to 1, the more similar):

cosine similarity

Euclidean distance measures the distance of two points (the smaller distance, the more similar):

Euclidean distance

According to my research, it’s suggested to apply cosine similarity when the magnitude of features doesn’t matter or even disrupt the outputs (for example, in finding similar text by word frequency), and apply Euclidean distance when the magnitude is important (for example, if the factors have actual meaning).

However, it’s still confusing for me to understand the actual eFfects simply on the definition or abstract case description, so I decided to explain the different results of the two metrics with some simple examples:

Eg1. Finding similar animals. Cosine similarity may find a cat more similar to a tiger because they are both from the feline family; while Euclidean distance may find a wolf more similar to a tiger because they are both fierce predators of similar size.

Eg2. Finding similar plants. Cosine similarity may find a small yellow flower more similar to a large white flower of the same family, while Euclidean distance may find a large red flower more similar to a large white flower of a different family but similar size.

Eg3. Finding similar songs. In the second link above, the author uses Euclidean distance that will choose songs that have similar values on three factors (valence, energy, and key) in total, while cosine will choose songs that have a similar ratio/distribution across the factors.

Eg4. Finding similar students. A is the top student with exam scores of 100, 100, 100 for English, Maths, Geography respectively. B is a student with scores of 50, 50, 50. C is another student with scores of 70, 70, 70. Cosine similarity will find A and C or A and B are equally similar (computed similarity is 1.00 in both cases), while Euclidean distance will find A and C are more similar (86.60>51.96).

Eg5. Finding similar movies (in theory). Back to our case, based on the above inference, I expect cosine similarity to find movies that have similar distribution on four latent factors (which might imply similarity in genres or structure of the movie, etc.), while Euclidean distance will find those similar in the factor values. (However, since we cannot make sense of its factors, we cannot truly understand how ALS works or predict the results.)

In a nutshell, cosine similarity is the similarity of ratio/scale, while Euclidean distance is the similarity of actual value.

Now it seems better to understand right? (At least this works for me.)

We can see that both methods make sense in a way, so it is up to the user to decide which result can best fulfill their purpose. In the example of finding similar students, if someone values the balance across different subjects, then they would probably prefer cosine similarity; if someone values the outstanding intelligence, they might think Euclidean distance makes more sense.

Now let’s go back to the movie recommendation engine and see how the two methods perform in the case of finding similar movies. (GitHub link to my movie recommendation engine for reference)

To begin with, I would like to assume that if a movie is from a series, then the most similar movies must include other movies in the same series. So I decided to search similar movies for Harry Potter and the Sorcerer’s Stone (2001) and expected to get results containing other Harry Potter movies.

Here’s what I got from both similarity metrics:

top 20 similar movies by cosine similarity
top 20 similar movies by Euclidean distance

Well, maybe later Harry Potter movies had changed drastically from the earlier ones, or at least in the model’s opinion. This could be possibly true since the director and actors changed and the plot developed over the ten years. Euclidean distance seems to do a better job, as there’re three HP movies in its top 20 list. I am also relieved to see cosine similarity has put HP and the Chamber of Secrets (2002) at the top of its list, though it’s the only one from the HP series on its top 20 list.

How about The Avengers (2012)? Could ALS find other Marvel blockbuster movies?

top 20 similar movies by cosine similarity
top 20 similar movies by Euclidean distance

Again, we could see that Euclidean distance again outperforms cosine similarity, with 5 Marvel movies on its list, compared to only two. For the two directly related movies of Captain America that appear on both lists, Euclidean distance has ranked them higher (i.e. consider them with higher similarity) than cosine similarity.

After all the analysis, I prefer to use Euclidean distance to compute similarity scores. However, an ultimate problem still exists: when searching for similar movies, which kind of similarity are people actually searching for? It seems to largely depend on each person’s preference, so the only solution is to let users manually choose which movies are more similar in their opinion.

That is to say, we’d better use A/B testing to decide which method has a better performance in real life. ❦︎

Comments and feedbacks are welcome! ;-)

--

--