Neighborhood Based Collaborative Filtering leverages the behavior of other users to know what our user might enjoy. It may find people similar to our user and recommend stuff they liked or recommend stuff that other people bought after buying what our user has bought. Same can be done to items as well!

PRODUCING SIMILAR ITEMS/PEOPLE

We may measure the similarity between things or similarity between people by doing the following

  • Using similarity metrics (Cosine Similarity)
  • SPARSITY” — Big challenge on measuring similarities based behavior data.
    * There are so many movies that most of the movies aren’t watched
    * It is tough to work with collaborative filtering unless we have lots of behavior data
    * We cannot perform cosine similarity if no items/people are in common. In other words, how can we find similarity if there is nothing similar?
    * Storing sparse data consumes a lot of space. This problem is solved with sparse arrays
  • Quality and quantity of data are MORE IMPORTANT than the algorithm we chose!

OTHER WAYS TO COMPUTE SIMILARITY

There are many different ways to compute similarity. Some of them are listed below

  • Cosine Similarity Metric
  • Adjusted Cosine Similarity: Same as Pearson Similarity
  • Pearson Similarity: Same as Adjusted Cosine Similarity
  • Spearman Rank Correlation: Same as Pearson but with ranks
  • MSD Similarity: Easier to understand but low performance
  • Jaccard Similarity: Good for implicit data. We can apply cosine similarity to Jaccard!

Let us understand each of the similarity metrics mentioned above.

1. Adjusted Cosine Similarity

  • Applicable mostly for the similarity of users based on their ratings
  • Takes into consideration varying baselines for ratings.
  • Adjusted cosines attempt to normalize these differences
  • (xi-xBar): We are looking at variance from the mean of each user rating instead of raw ratings. Sounds good but data sparsity can mess you up. Worth the experiment if data is NOT sparse.

2. Pearson Similarity

  • The difference in ratings and average for all users for the given item
  • In the real world of sparse data — Good Approach
  • SIMILARITY BETWEEN PEOPLE BY HOW MUCH THEY DIVERGE FROM AVERAGE PERSON’S BEHAVIOUR
    i.e. People who hate star wars will all have the same Pearson score
  • In Surpriselib, Adjusted Cosine is named as ‘User Based Pearson Similarity’ and Pearson Similarity is named as ‘Item Based Pearson Similarity’

3. Spearman Rank Correlation

It has an advantage that it deals with ordinal data effectively, But unfortunately, ordinal data is never used in the real world.

  • Same as Pearson Similarity but with some changes
  • Instead of using rating scores directly, use Ranks.
  • Interchanging ‘Avg rating(iBar)’ with ‘Rank amongst all movies’ (doesn’t worth it!!)

4. Mean Squared Difference Similarity

  • MSD(x, y): Computes error i.e determines how different user ratings are.
  • MSDSim(x, y): Computes similarity.

Note: Cosine is better than MSDSim

5. Jaccard Similarity

  • Since it is just counting things up, we are not looking at actual rating values at all here. This leads to loss of pretty important information.
  • But with implicit ratings, we have ‘did’ / ‘did not ratings’
    In that case, Jaccard can be a reasonable choice — Fast to compute.

In the next section, we will have a look at User as well as Item Based Collaborative filtering.

--

--