Collaborative Filtering

Now, we will have a look what if we don’t have features x1 and x2.

Akshita Guru
Operations Research Bit
8 min readAug 5, 2024

--

Welcome back!! In the last article we saw how to use linear regression to predict the movie ratings; if you haven’t then click here.

Image by https://team.inria.fr/zenith/zenith-seminar-friday-10-oct-at-330-pm-mohamed-reda-bouadjenek/

Below is the data we had before. But what if instead of having these numbers for x_1 and x_2, we didn’t know in advance what the values of the features x_1 and x_2 was? Let’s replace them with a question mark.

Image from last article

Now, just for the purposes of illustration, let’s say we had somehow already learned parameters for the four users. Let’s say that we learned parameters w¹ equals 5 and 0 and b¹ equals 0, for user one. W² is also 5, 0 b², 0. W³ is 0, 5 b³ is 0, and for user four W⁴ is also 0, 5 and b⁴ 0, 0.

We’ll worry later about how we might have come up with these parameters, w and b. But let’s say we have them already. As a reminder, to predict user j’s rating on movie i, we’re going to use w^j dot product, the features of x_i plus b^j.

To simplify this example, all the values of b are actually equal to 0. So let’s ignore b for the rest of this example.

The question is, given these values for w that we have up here, what choice for x_1 will cause these values to be right?

  • Well, one possible choice would be if the features for that first movie, were 1, 0 in which case, w¹.x¹ will be equal to 5, w².x¹ will be equal to 5 and similarly, w³ or w⁴ dot product with this feature vector x_1 would be equal to 0.
  • What we have is that if you have the parameters for all four users here, and if you have four ratings in this example that you want to try to match, you can take a reasonable guess at what lists a feature vector x_1 for movie one that would make good predictions for these four ratings up on top.
  • Similarly, if you have these parameter vectors, you can also try to come up with a feature vector x_2 for the second movie, feature vector x_3 for the third movie, and so on to try to make the algorithm’s predictions on these additional movies close to what was actually the ratings given by the users.

Let’s come up with a cost function for actually learning the values of x_1 and x_2. By the way, notice that this works only because we have parameters for four users. That’s what allows us to try to guess appropriate features, x_1.

This is why in a typical linear regression application if you had just a single user, we don’t actually have enough information to figure out what would be the features, x_1 and x_2, which is why in the linear regression contexts, you can’t come up with features x_1 and x_2 from scratch. But in collaborative filtering, is because we have ratings from multiple users of the same item with the same movie. That’s what makes it possible to try to guess what are possible values for these features.

  • Given w¹, b¹, w², b², and so on through w^n_u and b^n_u, for the n subscript u users. If we want to learn the features x^i for a specific movie, i is a cost function we could use which is that. We’re going to want to minimize squared error as usual.
  • If the predicted rating by user j on movie i is given, let’s take the squared difference from the actual movie rating y,i,j. As before, let’s sum over all the users j. But this will be a sum over all values of j, where r, i, j is equal to I. we’ll add a 1.5 there as usual.
  • As we defined this as a cost function for x^i. Then if we minimize this as a function of x^i we will be choosing the features for movie i. So therefore, all the users J that have rated movie i, we will try to minimize the squared difference between what your choice of features x^i results in terms of the predicted movie rating minus the actual movie rating that the user had given it.
  • Then finally, if we want to add a regularization term, we add the usual plus Lambda over 2, K equals 1 through n, where n as usual is the number of features of x^i squared.
  • Lastly, to learn all the features x1 through x^n_m because we have n_m movies, we can take this cost function on top and sum it over all the movies.
  • Sum from i equals 1 through the number of movies and then just take this term from above and this becomes a cost function for learning the features for all of the movies in the dataset.
  • So, if we have parameters w and b, all the users, then minimizing this cost function as a function of x1 through x^n_m using gradient descent or some other algorithm, this will actually allow you to take a pretty good guess at learning good features for the movies.
  • This is pretty remarkable for most machine learning applications the features had to be externally given but, in this algorithm, we can actually learn the features for a given movie.
  • Here’s the cost function for learning the features. This is what we had derived. Now, it turns out that if we put these two together,terms here is exactly the same.
  • Notice that sum over j of all values of i is that r,i,j equals 1 is the same as summing over all values of i with all j where r,i,j is equal to 1. This summation is just summing over all user movie pairs where there is a rating. What we’re going to do is put these two cost functions together and have this summation more explicitly as summing over all pairs i and j, where we do have a rating of the usual squared cost function and then let’s take the regularization term from learning the parameters w and b, and put that here and take the regularization term from learning the features x and put them here and this ends up being our overall cost function for learning w, b, and x.
  • It turns out that if you minimize this cost function as a function of w and b as well as x, then this algorithm actually works.
  • If we had three users and two movies and if we have ratings for these four movies, but not those two, over here does, is it sums over all the users.
  • We’re summing over users first and then having one term for each movie where there is a rating. But an alternative way to carry out the summation is to first look at movie 1, that’s what this summation does, and then to include all the users that rated movie 1, and then look at movie 2 and have a term for all the users that had rated movie 2.
  • We see that in both cases we’re just summing over these four areas where the user had rated the corresponding movie. That’s why this summation on top and this summation here are the two ways of summing over all of the pairs where the user had rated that movie. How do you minimize this cost function as a function of w, b, and x? One thing you could do is to use gradient descent.

With collaborative filtering, the cost function is in a function of just w and b is now a function of w, b, and x. We’re using w and b here to denote the parameters for all of the users and x here just informally to denote the features of all of the movies.

  • But if we’re able to take partial derivatives with respect to the different parameters, we can then continue to update the parameters as follows. But now we need to optimize this with respect to x as well. We also will want to update each of these parameters x using gradient descent as follows.
  • It turns out that if you do this, then you actually find pretty good values of w and b as well as x. In this formulation of the problem, the parameters of w and b, and x is also a parameter.
  • Then finally, to learn the values of x, we also will update x as x minus the partial derivative respect to x of the cost w, b, x.

The algorithm we derived is called collaborative filtering, and the name collaborative filtering refers to the sense that because multiple users have rated the same movie collaboratively, given us a sense of what this movie maybe like, that allows you to guess what are appropriate features for that movie, and this in turn allows us to predict how other users that haven’t yet rated that same movie may decide to rate it in the future.

This collaborative filtering is this gathering of data from multiple users. This collaboration between users to help you predict ratings for even other users in the future.

I hope you found the explanation of the collaborative filtering algorithm’s interesting. Keep checking back for more new content.

You can connect me on the following:

Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

--

--