Wonder how a Recommendation System works?

Manraj Chalokia
CodeX
Published in
4 min readAug 14, 2021

We all use Netflix, YouTube, and Amazon extensively in our daily lives. Ever wondered how recommendation systems work? With the passage of time, the recommendations seem to have become increasingly customized. Of course, why wouldn’t that be?

We rate movies, like/dislike videos/ purchase products. There is a lot of data out there, but is it enough to build a robust recommendation system?

This article is going to outline how we could build recommendation systems for our applications. I will not delve much deep into the math and would try to keep it as simple as possible.

There are two major recommendation system techniques that I would talk about.

1. Content-based recommendation

2. Collaborative Filtering

Let’s talk a little bit about the content-based recommendation system and then I will hop on to the Collaborative Filtering technique, Math magic.

What exactly does content-based recommendation mean? How do we model this as a machine learning problem?

There are usually 2 major components in a recommender system. One is the user and the other, the product. Amazon, Netflix, YouTube have millions of users and millions of products, movies/documentaries/series, videos.

We can visualize our data as a data matrix (I would use D to refer to this matrix throughout the article) where each row indicates a user and each column indicates a product or vice-versa. In both cases, the size of this matrix is going to be huge, and when I say huge, I mean VERY HUGE. What does each cell in this N x M matrix represent? (N: number of users and M: number of products). These cell values can either be ratings or be Boolean(true/false) values, say if a person did watch the video or if he/she didn’t watch the video.

This matrix is going to be really very sparse. Why? How many movies do you rate on Netflix or how many videos do you like or dislike? Very few, and this behavior is similar across most of the users, making this data matrix very-very sparse.

In content-based recommendation systems, you will have to come up with feature vectors for the users and for the product.

You would then utilize these feature vectors to predict the data matrix values (i.e., whether a person would give a 1,2,3,4, or 5-star rating to a movie or if he/she would watch or not watch a particular video). This simply boils down to a standard classification/regression problem.

There is a small problem with this way of building a recommendation system. Oftentimes, the data matrix D is going to be very sparse and the proportion of data points coming into the training set is going to be substantially less than the ones coming in the test set. How do we ensure that our content-based recommendation system is going to be robust?

Okay, let's jump to another elegant technique: Collaborative Filtering.

I would prefer to call it Math’s magic. Machine learning is a field that does borrow some ideas from a lot of other domains like statistics, mathematics, probability, thermodynamics, and whatnot.

There is a technique named Matrix factorization which can be used to build something which is known as a collaborative filtering-based recommendation system.

A brief about Matrix Factorization:

Let’s say we have our Data Matrix/Ratings matrix (D): We can write it as the product of 2 separate matrices U and P-transpose. This is exactly analogous to how we would write 10 = 2 * 5. The matrices U & P-transpose are considered to be the factors of the data matrix D.

Dimensions of matrix D: N * M (N: no. of users, M: no. of products, D: sparse matrix which contains our ratings)

Dimensions of Matrix U: N * f

Dimensions of Matrix P: M * f, implying that P-transpose would have a dimension of f * M.

I will use P.T as a short form for P-transpose.

(Don’t worry about f yet, it’s a hyperparameter, we can tune it later)

I will come to why do we need these matrices sometimes, but let’s figure out how to find these matrices U & P.

We solve optimization problems in machine learning, don’t we? Yes, we can use stochastic gradient descent to find these matrices. Our loss would be summed over all the non-empty data points (ratings in D),

(over all non-empty cells of D): ∑ [(Dij — Ui * Pj.T) ^ 2].

Dij: The non-empty rating given by User i on Product j.

Ui: ith row of matrix U indicating an f-dimensional vector for the User-i

Pj: jth row of matrix P or jth column of P.T (P’s transpose). This too is an f dimensional vector.

Okay, now we have the matrices U & P.

U & P.T are factors of our data matrix D, implying that U x P.T = D with of course some minute errors. (Loss is not always zero).

We fill the empty cells in D.

If we wish to fill D (2,3), we would take the dot product of the vectors U (2) and P (3). What does D (2, 3) mean? It’s the predicted rating which user U (2) would give on product P (3).

We have a matrix D now, where all the ratings are filled. Now, I guess it’s pretty easy and straightforward to go ahead and recommend a movie/video to a user.

Since D if filled (with no empty cells), we can sort the column values for each user (i.e., each row). This is how collaborative filtering works. Isn’t this magic done using Maths? 😀

Before I put an ending note to this article, I chose the names U and P to make things a little intuitive.

U: N * f matrix represents f dimensional feature vectors for all the N users.

P: M * f matrix represents f dimensional feature vectors for all the M products/videos/movies.

I hope you like it!! 😀

--

--