Neighborhood Based Collaborative Filtering — Part 4

Published in

Fnplus Club

4 min readJul 21, 2019

Neighborhood Based Collaborative Filtering leverages the behavior of other users to know what our user might enjoy. It may find people similar to our user and recommend stuff they liked or recommend stuff that other people bought after buying what our user has bought. Same can be done to items as well!

PRODUCING SIMILAR ITEMS/PEOPLE

We may measure the similarity between things or similarity between people by doing the following

Using similarity metrics (Cosine Similarity)
“SPARSITY” — Big challenge on measuring similarities based behavior data.
* There are so many movies that most of the movies aren’t watched
* It is tough to work with collaborative filtering unless we have lots of behavior data
* We cannot perform cosine similarity if no items/people are in common. In other words, how can we find similarity if there is nothing similar?
* Storing sparse data consumes a lot of space. This problem is solved with sparse arrays
Quality and quantity of data are MORE IMPORTANT than the algorithm we chose!

OTHER WAYS TO COMPUTE SIMILARITY

There are many different ways to compute similarity. Some of them are listed below

Cosine Similarity Metric
Adjusted Cosine Similarity: Same as Pearson Similarity
Pearson Similarity: Same as Adjusted Cosine Similarity
Spearman Rank Correlation: Same as Pearson but with ranks
MSD Similarity: Easier to understand but low performance
Jaccard Similarity: Good for implicit data. We can apply cosine similarity to Jaccard!

Let us understand each of the similarity metrics mentioned above.

1. Adjusted Cosine Similarity

Applicable mostly for the similarity of users based on their ratings
Takes into consideration varying baselines for ratings.
Adjusted cosines attempt to normalize these differences
(xi-xBar): We are looking at variance from the mean of each user rating instead of raw ratings. Sounds good but data sparsity can mess you up. Worth the experiment if data is NOT sparse.

2. Pearson Similarity

The difference in ratings and average for all users for the given item
In the real world of sparse data — Good Approach
SIMILARITY BETWEEN PEOPLE BY HOW MUCH THEY DIVERGE FROM AVERAGE PERSON’S BEHAVIOUR
i.e. People who hate star wars will all have the same Pearson score
In Surpriselib, Adjusted Cosine is named as ‘User Based Pearson Similarity’ and Pearson Similarity is named as ‘Item Based Pearson Similarity’

3. Spearman Rank Correlation

It has an advantage that it deals with ordinal data effectively, But unfortunately, ordinal data is never used in the real world.

Same as Pearson Similarity but with some changes
Instead of using rating scores directly, use Ranks.
Interchanging ‘Avg rating(iBar)’ with ‘Rank amongst all movies’ (doesn’t worth it!!)

4. Mean Squared Difference Similarity

MSD(x, y): Computes error i.e determines how different user ratings are.
MSDSim(x, y): Computes similarity.

Note: Cosine is better than MSDSim

5. Jaccard Similarity

Since it is just counting things up, we are not looking at actual rating values at all here. This leads to loss of pretty important information.
But with implicit ratings, we have ‘did’ / ‘did not ratings’
In that case, Jaccard can be a reasonable choice — Fast to compute.

In the next section, we will have a look at User as well as Item Based Collaborative filtering.