Matrix Factorisation — Part 7

Rakesh4real
Fnplus Club
Published in
6 min readJul 29, 2019

Collaborative filtering is a good method. But if it is that good, why are we seeking an alternative approach?

  • It is sensitive to NOISE DATA and SPARSE DATA
  • Good results are obtained only if to of the conditions below are satisfied.
    — Large data set is available
    — Data is nice and clean

MODEL BASED METHODS

Instead of finding items or users similar to each other, we apply apply Machine Leaning and Data Science in model based methods. Basic steps involved are :

  • Training our model with data
  • Predicting using model

MATRIX FACTORISATION

  • It has many sub categories of techniques.
  • CREEPY! Manages to find broader feature of users/items on their own (eg. Action or Romantic). Math doesn’t know what to call newly found features which are simply described by matrices.

MAIN IDEA:

Describe users and items as combinations of different amounts of each other.

For example, we may say Bob likes action movies as well as comedy movies. The same can told as

Bob = 50% Action +20% Comedy

Matching movies for recommendations with these attributes is a good idea for predictions

PRINCIPAL COMPONENT ANALYSIS(PCA) ON MOVIE RATINGS

PCA is a statistical procedure which reduces the dimension of our user-item matrix without losing any important information!

It is used as,

  • Dimentionality reduction tool(Can’t say what newly-reduced-dimensions mean)
  • Feature extraction tool

Note: Often, the dimensions they find correspond to features humans have learnt to associate with items (eg. How ‘action-ey’ a movie is)

Whatever it is about movies that causes individuals to rate them differently, PCA extracts and finds those ‘Latent Features

If R (Orignal Matrix) has some missing values, we can reconstruct R by filling blank values!!

SIGMA MATRIX

It is a simple diagonal matrix only used to scale the values we get into proper scale -

  • We can multiply this sigma matrix with M or U and still R will just be product of two matrices
  • Reconstruct R to get ratings of all users for all items!
  • We can even predict ratings using dot products of using recunstruction formula described above

SINGULAR VALUE DECOMPOSITION (SVD)

A way of computing M, Sigma, U-Transpose all at once efficiently and get all three factors in one shot!

  • Very accurate results
  • Used widely during Netflix prize competition

But wait! Just before applying PCA to original matrix R, how to deal with missing values?

  • You may fill missing values with averages of some sort… but there is a better way!

Assume we have some ratings for any given row/column in M and U-Transpose

  • We can treat this as ‘minimisation of profits’ problem
  • Find values of those complete rows and columns that best minimizes errors in known ratings R
  • Use Stochastic Gradient Descent (Just one way to do it)
  • Apache Spark uses a different technique called ALS (Alternating Least Squares)

Note: You might be confused here because we are talking about learning the values as matrices in M and U.

Actually, we are predicting ratings and not computing them directly which is what SVD does. We are not doing real SVD Recommendations because SVD can’t perform on missing data. Hence, it is ‘SVD-Inspired-algoithm’ not SVD itself. Winner of Netflix prize was a variation of SVD called SVD++

TIPS:

  • You can see source code for SVD algrithm used in surpiselib in github. But too complex to understand.
  • Never write your own algorithm. Odds are too high that you will mess up somewhere. Use third party library that has been used and validated by others.

SVD v/s SVD++

  • Actual loss function used during SGD
  • In SVD++, merely rating item at all is an indication of some sort of interest in the item. No matter what the rating was!
  • SVD++ outperformed SVD
  • Worked really well!!
  • It is Spooky because we don’t understand the way latent features work

GIVEN, THE USAGE OF SVD IN NETFLIX COMPETITION AND IT’S SUCCESS; A LOT OF RESEARCH FOCUSED ON SVD. THE BELOW
LISTED ONES ARE JUST SOME OF THE ALGORITHMS THAT WERE DEVELOPED

I . Factorisation Matrices

  • Well suited for predicting ratings/clicks in Recommendation System
  • More ‘General Purposed
  • Handles sparse data with ease. Unlike shoehorning itself into problem like SVD
  • Available in Amazon Sage
  • Only downsize — WORKS ONLY WITH CATEGORICAL DATA!

II. TimeSVD++

  • Used for predicting next item in series of events.

III. Factorized Personalized Markov Chains

  • Same as above. Used for predicting next item in series of events.

IV. PLSA (Promising!)

  • Extracts latent features from content itself
  • Use PLSA on ‘movie titles’ and ‘description’ and match them up with users
  • As they are content based, they don’t do well on their own. Add it with user behaviour data.

HYPERPARAMETER TUNING(SVD)

Hyperparameters are the parameters chosen by the ML engineer himself in order to recommend with effective predictions. Finding optimal hyperparameters is called hyperparameter tuning.

  • Big problem in ML in general
  • A lot of algorithms are very sensitive to parameters such as Learning rates
  • Hyperparameters vary with datasets

In case of SVD, Hyperparameters are —

  • How many latent factors to extract?
  • How many dimensions to reduce to?

Note: Both of them depend on the nature of data we have.

Example Code

The below code helps us understand how we can tune hyperparameters in surpriselib using GridSearchCV package.

Hyperparameter Tuning

GridSearchCV package,

  • Helps in hyperparameter tuning
  • Defines grid of different parameters you want to try different values for (try every possible combination)
  • param_grid’ defines what to test
  • measures’ define how to test
  • cv’ set number of folds/cross_validation

SLIM — Sparse Linear Methods (2011)

MAIN IDEA: ‘Sparse aggregation’ of other things user has rated multiplied by ‘sparse aggregation’ of weights (magic!)

  • Exiting results!
  • Outperformed all other recommendations (Hit Rates in TopN Recommendations)
  • Not only applicable to movies but books, music, credit cards and many more as well!
  • Because user has rated only some items and weights only exist for these items can be calculated — ‘sparse’ aggregation
  • Extend to entire user item rating matrix with the formula below

HOW WEIGHTS ARE COMPUTED ?

  • Complicated
  • So complicated that many can’t implement it in code
  • Similar to SVD but requires higher mathematical concepts like
    — Matrix Frobenius Norm
    — L1 Norm Regularisation
  • If SLIM comes your way, give it a second look — Promising Stuff!

--

--