Matrix Factorization Intuition for Movie Recommender System

Himang Sharatun
SkyshiDigital
Published in
8 min readApr 1, 2018

Have you ever heard about Barry Schwartz’s Paradox of Choice? Well, I suppose most of you never heard about it since cute cat on internet is obviously more interesting than a fantastic book talking about how options affect human happiness, right? Anyway, this paradox is introduced by Barry Schwartz in his book which argue that there is saturation point where more options is a bad thing. How? Yes, having choice means having a freedom to choose but, when we have too much choice, it will paralyzed us due do the complexity of decision that exponentially grow with the number of options we have. To put it simply, more choice means more time to compare the options. Not to mention, according to Schwartz, more choice will make human more likely to regret their decision. Why? Because “Why don’t I choose the other options?” is the first question that human will ask when there is a problem with their choice. Yup, options make human complain more about what they’ve got rather than accept it and be contempt.

Fortunately, in this era where machine learning could solve anything, this paradox can be prevented by implementing recommender system. In fact, recommender system has been implemented in popular website such as youtube, amazon, etc to help their customer easier to make a choice and more importantly find the most suitable product for them. By using recommender system, we don’t need to searching through millions of youtube video to find video that interesting for us. Recommender system will simplify our decision making process by providing list of 10–20 videos that it choose based on our preference. What a time to be alive.

There are a lot of algorithm to implement recommender system and one of the algorithm that attract researcher attention is Matrix Factorization (MF) introduced by Yehuda Koren et al. MF is an algorithm that could perform better with sparse data compared to conventional collaborative filtering. In this article I would like to explain how MF works using your friendly English.

Preference is just set of numbers

To understand how the concept of recommender system especially MF, let’s pay attention to the following conversation:

Andy : Could you give me some movie recommendation, I want to watch something this weekend?
Bob : What about Black Phanter?
Andy : What kind of movie is that?
Bob : It’s unique superhero-themed movie that combine action and comedy perfectly
Andy : Hmm…I don’t really like action movie. I’m looking for something like Spotlight which contain not only about drama but also portray journalistic philosophy with strong moral message.
Bob : Oh I see, I think you’re going to love Nothing But the Truth. It’s about a journalist who threaten with imprisonment because he refuse to reveal his source to CIA.
Andy : That’s it. I think that movie suits me better. Thanks, now I have something to watch this weekend
Bob : No problem

The conversation above is an analogy to how MF works. Andy is a user that looking for movie recommendation and Bob is the recommender system. In the end of the conversation, Bob finally can give suitable movie recommendation for Andy because he knows what kind of movie Andy likes and he knows a lot about movie. If Bob doesn’t know what kind of movie Andy likes or he doesn’t know what kind of movie Nothing But the Truth is, he will not be able to recommend Nothing But the Truth to Andy and will only give random recommendation that not suitable for Andy like what happen in the beginning of the conversation when Bob recommend Black Phanter to Andy. The process of how Bob discovering Andy movie preference from the previous movie that Andy has watch and the process of match Andy preference with movie characteristic is the principle that MF trying to mimic. Now that I think about it, every cool machine learning algorithm like neural network usually inspired by human behavior. By the way, Black Phanter is a cool movie too. I watched it with my co-workers and they keep “Hu hu hu” ing around for weeks. LOL.

Let’s get back to our recommender system. Unlike Bob, computer doesn’t understand abstract concept such as how good the movie is?, how cool the action? how funny it is? etc. Therefore, in MF, we need to represent user preference and movie characteristic as a set of number. So for Andy, because he likes dramatic journalism movie and doesn’t like action superhero movie his preference could be describe as follow:

and the characteristic of movie that mentioned in the conversation can be describe as follow:

Since preference and movie characteristic is just a number, recommender system just need to apply dot matrix multiplication between Any preference to all movie characteristic to find the most suitable movie for Andy. This table below explain how calculation for deciding recommendation for Andy.

As you can see that Black Phanter get lowest score while Nothing But the Truth get the highest score, so it’s obvious that Bob will recommend Nothing but the Truth to Andy.

How on earth we get this set of number?

Now you must be wonder how on earth we get this set of number, right? Don’t worry I will explain it but before that from now on let’s call this set of number as feature. If you are familiar with machine learning you must have at least heard the term feature. Feature is basically a set of number that describe the characteristic of item in our data and like I said before, it is just to make the computer able to understand our data.

In the conversation, Bob knows that Andy like dramatic journalism movie because Andy say that he likes “Spotlight” and from that information Bob conclude Andy’s movie preference. MF discover the set of number using similar principle which is by using previous movie that user has watched and give rating to. So in this case you for example Andy give 5 out of 5 rating to “Spotlight” and 1 out of 5 rating for “Black Phanter”. From that information MF will extract the feature of Andy’s preference and feature of movie characteristic. Easy, right?

Nope, it is complex as hell. To extract features from rating we need to use statistical modeling to create update rule for training. Yes, MF recommender system, just like any other machine learning algorithm require training to create model. In MF we will create prediction model to predict which movie user might give high rating to. Before we go to hell of statistical modeling, let’s look at the picture below that explain general process of MF recommender system.

To implement MF first we generate random number then through iterative training process those numbers are updated until it could produce dot matrix result similar to previous rating. After we get preference and movie characteristic feature, we match both matrix by using dot matrix to get rating prediction. This rating prediction will be sorted to decide which movie is the most suitable for user.

Let’s go to statistical modeling hell

Now, we finally have to go through this process of creating a statistical model for the training process. The first thing that we need to do to create any statistical model is to determine the goal of our model. In our case, the goal is to minimize the error between rating prediction and known rating. Why? Because in MF we will use previous rated movie as reference to discover user preference and movie characteristic. Remember that MF is a prediction model machine learning and the simplest way to create a prediction model is compare it with known data. Therefore our model objectives is to minimize the error when we compare prediction with known data or rating.

Training process in MF basically is just keep updating the preference and movie characteristic features until we find the best features representation with minimum error when we compare it with known data. To update the features, we could just do it randomly but of course it’s gonna takes a long time to find the best features. Therefore we need to define certain rule that will efficiently update the feature. Ideally, we want to create an update rule that will minimize the error in each iteration so it will minimize the number of iteration needed to discover the features. In statistic, to solve this problem we can use gradient descent as depicted in following formula.

Source: Yeung, 2010

For those who never heard what gradient descent is, it’s simply a method that will measure the direction of an error using gradient. So for example, our known data is 4 but the prediction is 2.5. That means that the update rule should update the features into higher number to minimize the error and not the opposite.

To implement gradient descent we need to partially differentiate the error with preference features and movie characteristic features separately. Remember that our rating prediction is affected by those 2 features and if we do normal difference our update rule will become more complex since we need to take both features into consideration at the same times. In partial difference, we assume that one of the features is constant thus not gonna affect the prediction and we just need to difference the other features in order to find the direction of the update. For detail explanation how gradient descent works, you could watch this video.

You need to understand that gradient descent JUST tell you on what direction the update should be but it DOESN’T tell you how big the update should be. Yes, indeed you could just use the gradient descent result to update the features but, the number will be too big and will cause the error to move ups and down before it find the ideal features or in worst case it will not be able to find the ideal features. Therefore in our update rule, we use learning rate to regulate how big the update will be. To determine optimum learning rate we need to do trial and error because if it is to small the training will takes more time and if it’s to big there will be overshoot.

The classic problem in any supervised machine learning is overfitting which is a condition where the model manage to accurately predict for the data that we use in training process but is not able to predict accurately for new data. Therefore, in our update rule, we use beta or regularization to avoid overfitting. Regularization works by minimizing the influence of certain features to the model. So instead of assuming that our training data is a perfect data and extract all features from that data, we just partially extract the features from the data. Therefore the model created from training will not be only able to predict for an input inside the training data.

Usually in my article I always put a code to explain how an algorithm is implemented, but for this article I will ask you to visit this amazing blog that explain how MF is implemented in Python. If you have further question or want to discuss more about MF implementation for recommender system, please feel free to contact me through my email: himang@skyshi.io.

See you on the next article.

--

--