Singular Value decomposition (SVD) in recommender systems for Non-math-statistics-programming wizards
This is an attempt to give some intuition behind SVD usage and recommender systems. So this article is not a mathematical doctrine, or a research paper reference. It serves as a high level intuition on this topic for beginners exploring SVD in recommender systems.
SVD in the context of recommendation systems is used as a collaborative filtering (CF) algorithm. For those of you who don’t know, collaborative filtering is a method to predict a rating for a user item pair based on the history of ratings given by the user and given to the item. Most CF algorithms are based on user-item rating matrix where each row represents a user, each column an item. The entries of this matrix are ratings given by users to items.
SVD and Matrix factoriztion
SVD is a matrix factorization technique that is usually used to reduce the number of features of a data set by reducing space dimensions from N to K where K < N. For the purpose of the recommendation systems however, we are only interested in the matrix factorization part keeping same dimensionality. The matrix factorization is done on the user-item ratings matrix. From a high level, matrix factorization can be thought of as finding 2 matrices whose product is the original matrix.
Each item can be represented by a vector `qi`. Similarly each user can be represented by a vector `pu` such that the dot product of those 2 vectors is the expected rating
`qi` and `pu` can be found in such a way that the square error difference between their dot product and the known rating in the user-item matrix is minimum
For our model to be able to generalize well and not over-fit the training set, we introduce a penalty term to our minimization equation. This is represented by a regularization factor λ multiplied by the square sum of the magnitudes of user and item vectors.
To illustrate the usefulness of this factor imagine we have an extreme case where a low rating given by a user to a movie with no other rating from this user. The algorithm will minimize the error by giving `qi`a large value. This will cause all rating from this user to other movies to be very low. This is intuitively wrong. By adding the magnitude of the vectors to the equation, giving vectors large value will minimize the equation and thus such situations will be avoided (more here).
To reduce the error between the predicted and actual value, the algorithm make use of some characteristics of the dataset. In particular for each user-item (u,i) pair we can extract 3 parameters. µ which is the average ratings of all items, `bi`which is the average rating of item i minus µ and `bu` which is the average rating given by user u minus µ which makes the expected rating:
Thus, the final equation to minimize is:
Minimizing with Stochastic Gradient Descent (SGD)
The above equation is minimized using stochastic gradient descent algorithm. From a high level perspective SGD starts by giving the parameters of the equation we are trying to minimize initial values and then iterating to reduce the error between the predicted and the actual value each time correcting the previous value by a small factor. This algorithm uses a factor called learning rate γ which determines the ratio of the old value and the new computed value after each iteration. Practically, when using high γ one might skip the optimal solution whereas when using low γ values a lot of iterations are needed to reach optimal value (more here)
Now that you have enough information about SVD, it is time to use it. Surprise is a excellent python library that is well documented and easy to use. It implements SVD and other algorithms for recommender systems. Here is a simple tutorial to get you started.
Disclaimer: I am in no way a mathematics professional, or a recommendation systems expert. I work on these topics from pure personal interest. So there might be some mistakes and any correction is more than welcome. It will benefit me and all the readers :)