Solving that #NetflixStruggle

Recommenders


One of the biggest #firstWorldProblems is what to watch on Netflix. As the struggle is very much real, there is a solution — recommenders.

There are three general types of recommenders used:
1. Context based (Pandora’s Music Genome)
2. User based (Netflix, Last.fm)
3. Item based (Amazon’s)
2 and 3 are collaborative filters — meaning they use information provided by users to then recommend

In fueling our recommenders, we need:


User preferences: Our feature matrix of users’ as columns and their preferences for our product as rows. From these preferences, we can then — through clustering of our users and our products— find some latent qualities that we could then featurize on.

For a basic movie recommender dependent upon user ratings as features:

Python code:

def build_feature_matrix(self):
main = pd.merge(self.data, self.item, how=’inner’, on=’movie_id’)
main = pd.pivot_table(main, values=[‘rating’], cols=[‘user_id’], rows=[‘movie_id’])
main = main.fillna(-1)
return main

Product similarity: A matrix of our products’ and their similarities — usually measured as a distance between two products’ qualities.

Python code:

def compute_similarity(self, feature):
sim_matrix = np.zeros((self.rows, self.rows))
for i in xrange(self.rows):
for j in xrange(i+1,self.rows):
vec1 = feature.iloc[i,:][feature.iloc[i,:] != -1]
vec2 = feature.iloc[j,:][feature.iloc[j,:] != -1]
sim_matrix[i,j] = distance.euclidean(vec1,vec2)
sim_matrix = sim_matrix.T + sim_matrix
return sim_matrix

With our feature similarity matrix in hand…

We can then build a recommendation matrix per user.

  • Within this matrix, movies that the user has seen and has not seen are compared — values derived from the similarity matrix.
  • Then calculate a weighted value for the unseen movies, by multiplying the similarity metric between the seen and the unseen movies, and the user’s rating for the seen movie.
  • With our weighted value, we can then find the predicted user’s rating by summing the weighted values, and dividing it by the sum of the movie’s similarities.
  • Sort the predicted values and then return the top predicted ratings for that user.

But… GraphLab has a method for all of the above.

Recently, my class at Zipfian were pitched by Graphlab. Their package, geared towards data scientists, hopes to bridge the gap between Big Data storage and analytics tools.

With that, here’s a quick look into the code doing the above using Graphlab:

import graphlab as gl
#Graphlab's main data frame — SFrame for sparse data. 
main = gl.SFrame.read_csv(datafile)
#To cross validate: 
(trainN, testN) = main.random_split(.8)
#Creating the recommender model with user, item, and 
#recommendation parameter. Item-means uses the item's means to #predict the targeted value regardless of user.

modelNorm = gl.recommender.create(trainN, user_column="userid",item_column='track', target_column='rating',
method='item_means')
#To get top recommendations:
rec = model.recommend()
# To see how well our model did we use the rmse metric:
rmse = model.evaluate(test)


Overall…

Our discoveries in life, from movies to daily purchases, are made better with recommenders. Recommenders take information about our preferences, — featurizes it — find the similarity between the products, and then from there recommend products.

They work best with more data. The more these machines know about your likes and dislikes, the better the recommendation. So, the next time you bust on Netflix’s chops for giving you a bad recommendation, it’s probably your fault for not rating the movies you’ve seen when they ask, or sharing your account and not making individual profiles!!

PS. Netflix currently uses a recommender that blends a little bit of the above, with a lot of their version of context based recommendation. This is also very similar to Pandora’s recommender.

Email me when wini tran publishes or recommends stories