Item-item collaborative filtering with binary or unary data

Victor
Rn Engineering
Published in
8 min readAug 1, 2017

--

There are many examples out there of different types of collaborative filtering methods and user-user/item-item recommenders, but very few that use binary or unary data. Most of them make use of the MovieLens dataset with its 1–5 star ratings. But binary data is just as common if not more common in real life projects so I thought it might be appropriate to write up a simple (non state of the art) example.

Overview

Our goal is to build a simple recommender using the last.fm dataset. We want to be able to make song recommendations for a given user but also get similar songs to a given track. As a bonus, we would also like it to be fairly fast.

This guide builds on the Collaborative Filtering with Python post by Salem Marafi with the same dataset but with a slightly different approach and some additional bits and pieces. We’ll be using Python with pandas, numpy, scipy and sklearn.

We will start by looking at some theory, equations and quick demonstrations in Excel and after that dive into the code.

Collaborative filtering

Collaborative filtering (versus content-based filtering) means we don’t really care about anything about an item except who else has liked, viewed, ignored or somehow consumed it. We don’t care if items are similar because they both are blues songs, but because they were listened to by a similar set of users (who might or might not be hardcore blues fans).

In user-user collaborative filtering, we look at the similarity between users and find the users who are most similar to any given user and then recommend items based on their preferences.

For item-item collaborative filtering take a slightly different approach and start with the items. Here we first compute the similarities between different items (using ratings) and then make item recommendations based on the items any given user has liked or consumed.

For this example, we’ll be using the item-item approach and then what you could call user-item to get the user-specific recommendations.

The Last.fm dataset

The dataset we’ll be using is a subset of the last.fm dataset. It’s a sparse matrix containing 285 artists and 1226 users and contains what users have listened to what artists. We don’t care about when the user listened to the song as we assume that music tastes are fairly static over time.

You can Download it here.

A small part of the original dataset matrix

The Google spreadsheet

A smaller version of the dataset is available in the spreadsheet linked below. The full version got too heavy for Google Sheets but it shows the basic operations and calculations in a more manageable way. Might still take a second to load though.

Item-Item calculations

As mentioned above we start with computing the item-item relationships of our songs. Our final goal here is to construct a new item by item matrix containing the weights (relationships) between each of our songs where a perfect correlation equals 1 and no correlation at all equals 0.

An example of the type of matrix we want to create. Notice how each item perfectly correlates with itself.

To get this relationship between items we calculate the cosine similarity between the items using this formula:

This gives us the centered cosine similarity which expands to be the same as the Person correlation.

Basically what it means is that we take the dot-product of our different item-vectors and divide it by the product of the normalized vectors.

A normalized vector is the euclidean distance (or L2-norm) of that vector which means the square root of the sum of the absolute values squared:

To calculate the cosine similarity between an item in column D and item in row 5 in something like Excel or Google Sheets we would do the following:

  1. The dot product of the vectors:
SUMPRODUCT(ratings-norm!E2:E534,ratings-norm!D2:D534)

2. Euclidean distance of the vectors:

SQRT(SUMSQ(ratings-norm!D2:D535))
SQRT(SUMSQ(ratings-norm!E2:E535))

3. Putting it all together:

SUMPRODUCT(ratings-norm!$E$2:$E$534,ratings-norm!D$2:D$534)/(SQRT(SUMSQ(ratings-norm!D$2:$D$535))*SQRT(SUMSQ(ratings-norm!$E$2:$E$535)))

Normalize user vectors to unit vectors

There is still one aspect we’ve not yet touched upon but that we’ll need for our item-item matrix. This is the idea of normalizing the user vectors so that a user with many ratings contributes less to any individual rating. This is to say that a like from a user who has only liked 10 items is more valuable to us than a like from someone who likes everything she comes across.

We do this by first calculating the magnitude of all the user's ratings by taking the square root of the sum of the squares of all the user's ratings.

In other words, the same as we did for our L2 norm before but along the user-vector instead of the item-vector.

We then create the new unit vector by dividing the rating by the magnitude:

This normalization is applied before we do our similarity calculations, so instead of having a rating as ones or zero we have some value between 0 and 1 representing the importance of that users rating.

Item-item summary:

  1. Normalize user vectors to unit vectors.
  2. Construct a new item by item matrix.
  3. Compute the cosine similarity between all items in the matrix.

User-Item calculations

Now that we have our similarity matrix it’s time to make some user recommendations. To do this we look at the user's known likes and their similarities. Basically, we want to create a score for each item in our dataset for that user and then we can simply choose the n items with the highest score.

To get the score we use this formula:

So we get the score for user u and item i by summing together all the weights for that item Wij multiplied with the users rating for that item rui. We then divide by the sum of all the weights for that item Wij.

Again, to calculate the score for an item in matrix row 4 and user in ratings row 360 in Excel or Google Sheets we would do the following:

SUMPRODUCT(item-matrix!B4:JZ4,ratings-norm!B360:JZ360)) / (SUM(item-matrix!B4:JZ4)

If we were dealing with star rating data or similar we would also add some normalization to compensate for users having different personal rating scales but since the only option is 0 or 1 we don’t need to care about that.

Neighborhoods

Depending on the size of our dataset it might be inefficient to do scoring for all items for every user. Chances are we’re only interested in the score for the top 5 or 10 items and don’t really care about the score for the 500th one.

So instead we can limit our calculations to a neighborhood of only the n most similar items to the users own ratings and then do the scoring on those. That's what the jϵN in the numerator above describes. It says “for items j in neighborhood N”.

User-item summary:

  1. Define a neighborhood of items.
  2. Calculate the score for all items for a specific user.
  3. Sort by the n highest scores (most recommended)

OK, let’s build it!

First things first. We’ll import the libraries we’ll need and load the dataset from the .csv file. We’ll also create a new dataframe data_items without our user ids.

Then we’ll get started on the item-item calculations. First, we normalize the user ratings in data_items, and then run calculate_similarity to generate a new item by item matrix with our similarities called data_matrix. Lastly, we print the 11 most similar items to an artist.

In this case, the 11 most similar items to Beyoncé are:

  1. beyonce (of course)
  2. the pussycat dolls
  3. rihanna
  4. christina aguilera
  5. alicia keys
  6. justin timberlake
  7. britney spears
  8. leona lewis
  9. maria mena
  10. kelly clarkson
  11. nelly furtado

Next is the user-item step where we generate recommendations for a specific user (in this case user with id 5985). Here we use the score function from above to calculate the score for each item in the user vector. We then drop the user's known likes and display the top n recommended artists.

This gives us that the known user likes are Bob Dylan and The Cure and the five top recommended artists are:

  1. joy division
  2. the smiths
  3. david bowie
  4. yann tiersen
  5. the rolling stones

However, in the code above we did not use the notion of a neighborhood but rather calculated scores for every single item. Again, we probably don’t need to know that “scooter” is one of the absolute least recommended artists for this user.

So instead we construct a neighborhood of the top 10 artists most similar to Bob Dylan and The Cure and then score only those. Here we can play around with the size of the neighborhood to find a good balance.

Using the neighborhood we get the following result for the top 5 recommendations. It’s not completely different from the first example (which would not make sense) but the ordering is different and Tom Waits pushed Yann Tiersen off the list.

  1. joy division
  2. the smiths
  3. the rolling stones
  4. david bowie
  5. tom waits

Now we can also play around with some different user ids. For user 7434 who liked AC/DC, Cypress Hill and Rage Against the Machine we get:

  1. eminem
  2. bob marley
  3. portishead
  4. audioslave
  5. the prodigy

The full code

Summary

So there we have it, a simple binary item-item recommender in python. The above script takes around 200ms to run, and there are of course many ways to speed it up, but fairly OK for this example.

References

--

--