Model-based Recommendation System with Matrix Factorization — ALS Model and The Math behind

Matrix Factorization is widely used in recommendation system, we are going to build with the als model, and explain the math behind it

Jeffery chiang
Analytics Vidhya
4 min readMay 16, 2021

--

Collaborative Filtering is the most implemented and mature recommendation system. We are going to build the recommendation system with model based — matrix factorization, using the ALS model provided by pyspark.

GitHub Link: https://github.com/chiang9/Recommendation_system_pyspark/blob/main/ALS_model/movielen%20ALS.ipynb

Figure 1. Utility Matrix

A sparse matrix R can be built based on the data of user-item relation and their ratings. The scores can be obtained from the users or the user behaviors. Collaborative Filtering method uses the matrix R to produce the recommendations.

Figure 2. Matrix Factorization

The goal of matrix factorization method is to separate the utility matrix into the user latent matrix and the product latent matrix, such that

There are a lot of methods to factorize the utility matrix, such as the Singular Value Decomposition, Probabilistic Latent Semantic Analysis. In Alternative Least Square (ALS), it is an iterative process to optimize the factorization model.

Alternative Least Square (ALS)

ALS model is one of the most popular method in collaborative filtering. To see the math behind the model, we first define the objective function using the loss function — RMSE.

where real = R, prediction is equal U*P^T.

Assume there are m users and n items, R = m * n, U = m * k, P = n * k, where k is the latent factors.

In order to avoid overfitting, we add l2 norm to our objective function, such that

Objective function

Next, we take the partial differentiation respect to U and P.

By applying similar process, we can find the loss differentiation respect to P.

Therefore, we have both equations of U and P. By fixing one, we can optimize the other one. Iteratively alternate the latent matrix U and P, we are able to optimize the utility matrix factorization.

Data Source

In this example, we will be using the movielens dataset (ml-100k).

link: https://grouplens.org/datasets/movielens/

Let’s Start

In this example we are going to use pyspark, and the movielens dataset.

We cache the train and test dataset for further spark usage.

Next, we use the CrossValidator to tune the hyperparameters. In the Spark ALS model, we are able to define various parameters, rank, maxIter, regParam, and more can be found on https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#pyspark.ml.recommendation.ALS

where rank is the latent factors, k.

We are able extract the best model and their paramters from the CrossValidator after training.

We can find the item recommendations to specific users or users who might be interested in specific item using the following methods, recommendForAllUsers, recommendForAllItems, recommendForUserSubset, recommendForItemSubset.

In Addition….

There are two modules in the Spark ML, ml and mllib.

spark.ml module uses DataFrame, mllib uses RDD, which mllib is slowly being deprecated.

Conclusion

ALS Model is a powerful tool in building the recommendation system. Apache Spark provides a convenient API in building the model, however, most of time the model is not good enough at handling problems like data sparsity, cold start …etc. We need to combine with some strategies and user behavior analysis.

Thank you for reading, and wish you a wonderful day.

--

--

Jeffery chiang
Analytics Vidhya

Data Science | Machine Learning | Mathematics | DevOps