Improving Collaborative Filtering With Clustering

Jackson Wu
3 min readMay 27, 2019

--

One of the main problems in applying collaborative filtering techniques is the time complexity of forming the peer groups. Normally, you would calculate the pairwise similarity for every user in a user-based system and use k nearest-neighbors to find the top-k similar users. If we have m items, this process is O(m^2) respectively. This will not do when m is in the order of the tens of millions; it would be far too costly.

An alternate methods of forming peer groups is to use modified k-means clustering to find the nearest users/items for each user/item. This will form fewer peer groups, since we are not forming a peer group for each possible target user. Once we have these new peer groups, we can calculate the top-k closer users in the same way as the nearest neighbor methods.

Now, let’s go into greater detail on how k-means fits in with collaborative filtering and how k-means needs to be modified to work with sparse input.

K-means Image Source

K-Means

First, a brief conceptual overview of k-means.

K-means works by finding the most central point for each of the k clusters, choosing the closest data point to that center, re-constructing new clusters based on those centroids, and repeating those steps until convergence.

In user-based collaborative filtering, these sample data points would have n dimensions since each user is represented as a row in the m x n matrix. The process of forming the clusters looks like this:

  1. For all rows u of the m x n matrix (users), we assign u to the closest centroid using a distance function (like the Manhattan distance and the Euclidean distance).
  2. Reassign each of the k centroids to the centers of their newly formed clusters.
  3. Repeat.

Dealing With Sparsity

The main way we accommodate the sparse input is to only use specified ratings. We can find the mean of each cluster by only using the specified ratings within that clusters. We can also find the distance between each user and each centroid by only using the common specified ratings between each pair.

(Note that all of the above works for item-based collaborative filtering when we cluster columns instead of rows.)

So now, say we have implemented this correctly, and we now have k clusters of users. Now, every time we have to compute pairwise similarity for some user, we only compute this similarity across users in the same cluster!

This will drastically improve runtime, especially when the cluster are fine-grained. Keep in mind, this efficiency, as always, comes at a cost — this time, its accuracy.

Prediction function (from Recommender Systems: The Textbook)

From here, we’re golden. Once we have the peer group, it’s the same as before. We can use the same neighborhood prediction function from the user-based collaborative filtering we’ve already seen, and voila: piping hot predicted ratings.

Clustering isn’t the only way to efficiently find peer groups for neighborhood-based methods. You can also improve the training stage of collaborative filtering with dimensionality reduction, which I cover in this article.

--

--