Clustering Algorithm

Chitra Rajasekaran
4 min readApr 12, 2018

--

We all know about Netflix, the worlds largest on-demand internet streaming media and online DVD movie rental service provider.

  • Founded August 29, 1997 in Los Gatos, California by Marc and Reed
  • It has 69 million members in over 60 countries enjoying more than 100 million hours of TV shows and movies per day.
  • Most success with : House of Cards show
Netflix Founders

Netflix — A view of recommendations

Netflix puts it as: The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.” –www.netlfixprize.com

So what do they want? Improvement to their existing system for which they were paying $1 million.

We got the background and now lets move on to view the objective and the datasets.

Objective: “Which movie will you like” given that you have seen ‘The Dark night, Batman Begins, Pineapple express’?

How do we figure this out?

Assumption: We have the Netflix movie rating dataset and R-studio installed.

First let us take some time to go through the clustering algorithms.

The idea of Clustering is to group the items together based on thier attributes. The data is typically unlabeled and the similarity is measured using the distance between the two points. Example: Consider going to a Board Game shop and putting together the games from a pile of game box that are similar.

Types of Clustering:

  1. K- means — Partitional clustering
  • Memory based method that measures the distance between the query instance(movieID,UserID) and every instance in the training set. We find the K training instances with the least distance from the query instance and average the rating. This is the predicted rating for the query instance. We can use the ‘Euclidean Distance, Manhattan Distance, Minkowsky Distance or Mahalanobis Distance’ formulas to find the distance. The formula to use entirely depends on the dataset type.
  • We can ask how important is this distance measure?

It is very importance to normalize the features. E.g.: What if i was too generous about the movie rating and somone else is too conservative? We are comparing with very high personal biases which can result in flawed similarity measure. The solution for this is to normalize the data.

  • We can then ask what is this normalisation? how will it change my ratings and will this not loose the original rating?

We will calculate mean rating for every user that has been rated and also calculate the standard deviation for the user’s rating. For every rating we will subtract the mean rating and divide it by standard deviation. By doing this, we know people who have least distance from you should contribute more than the farther ones.

Imaginary Example:

  • Suppose we have dataset for 3 users (John, Chris,Vicky) along with their userID and Movieratings.
  • We calcualte the mean and s.d vectors.
  • We normalize the data.
  • so our query instance q(John, Batman) — Here we wish to calculate how much John will like Batman on a scale of 1–5.
  • To identify this, we need to identify Johns neighbors who rated this movie.
  • Users who rated this movie are Chris and Vicky. We find that the nearest neighbor is Chris based on the distance forumala. (say — 0.756)
  • This prediction output value is in normalized form. In order to get the actual prediction rating, we use — (0.756 *s.d.) + mean rating.

2. Agglomerative — Hierarchical Clustering

  • Clustering based nearest neighbor approach to obtain genre for every movie from external sources. We create a vector representing each genre as one cell and we count the number of moviews that users has rated in that particular genre. This has collective opinion of the users.
  • When given with the query instance q(MovieID,userID), we find all the genre for that movie. For each genre we calcualte the distance of the movie from cluster centres for the genre and combine the two clusters with smallest centroid distance. We average per genre predicted rating and get the predicted ratings for q.

Imaginary Example:

  • We use the same example from the above. (Suppose we have dataset for 3 users (John, Chris,Vicky) along with their userID and Movieratings.)
  • We find genre for every movie.
  • We convert (user — movie) data to (user — Genre).
  • We cluster users in to two clusters.
  • The query instance per last time is q(John, Movie1).
  • Per genre cluster looks like (Genres of ‘Movie1’).
  • For example if we have two genres as output ‘Thriller’ and ‘Horror’ and John rates ‘Thriller’ movies, we try to find one (k = 1) nearest cluster for thriller genre.

Netflix dataset and the clustering algorithm with the step by step code/process to work on using R-studio can be found from below link.

--

--

Chitra Rajasekaran

Building AI Apps|Full Stack Developer| Previously, Data Engineer