Build your personalized movie recommender with Scala and Spark

Knoldus Inc.
Knoldus - Technical Insights
3 min readJun 30, 2016

In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm.

What is a Recommendation Engine?

I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted as best movies. That is not not a recommendation. When you browse through several categories or clicked several catchy posters and you get lines like, “People who bought/watched this also bought/watched that”. That is an example of recommendation. It assumes that if you share a similar taste with someone, you are going to like what they liked.

Preparing the Data Set

MovieLens is a database which was prepared by the GrpupLens Research Project at the University of Minnesota. The data set we are going to use for building the recommendation engine contains over hundred thousands rated movies. Each user has rated at least 20 movies. Ratings of different sizes (1 million,

Ratings

User IDMovie IDRating115123223244

Movies

Movie IDNameGenre1Toy Story (1995)Animation|Children’s|Comedy2Jumanji (1995)Adventure|Children’s|Fantasy3Grumpier Old Men (1995)Comedy|Romance4Waiting to Exhale (1995)Comedy|Drama

About the Algorithm:

To make a preference prediction for any user, Collaborative filtering uses a preference by other users of similar interests and predicts movies of your interests which is unknown to you. Spark MLlib uses Alternate Least Squares (ALS) to make recommendation. Here is a glimpse of collaborative filtering method:













































MoviesUsersM1M2M3M4U12431U20044U33223U42?3?
Here user ratings on movies are represented as a matrix where a cell represents ratings for a particular movie by a user. The cell with “?” represents the movies which is user U4 is not aware of or haven’t seen. Based on the current preference of user U4 the cell with “?” can be filled with approximated rating of users which are having similar interest as user U4.

Building Recommendation Model

Spark MLlib uses Alternate Least Squares (ALS) to build recommendation model.

Loading the Data as Rating

[code lang=”scala”]
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile(“your_dir/ratings.dat”)
val ratings = data.map(_.split(“::”) match { case Array(user, item, rate, timestamp) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val movies = movieRecommendationHelper.getMovieRDD.map( _.split(“::”))
.map { case Array(movieId,movieName,genre) => (movieId.toInt ,movieName) }
val myRatingsRDD = movieRecommendationHelper.topTenMovies //gets 10 popular movies. See <a href=”https://github.com/akashsethi24/Machine-Learning/blob/master/src/main/scala/movie/RecommendMovie.scala">Code</a> for for details
val training = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 <= 3 }.persist val test = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 > 3}.persist
[/code]

Training the Model

The rating data is split into two part. The variable training contains 47% of data for training. Rest is kept for evaluating the model. We are going to join the Rating of user U4 (the user for which we are going to predict) for the movies he has seen so far so that we can learn taste of user U4 about movies.

[code lang=”scala”]
val rank = 8
val iteration = 10
val lambda = 0.01
val model = ALS.train(training.union(myRatingsRDD), rank, iteration, lambda)
val moviesIHaveSeen = myRatingsRDD.map(x => x.product).collect().toList
val moviesIHaveNotSeen = movies.filter { case (movieId, name) => !moviesIHaveSeen.contains(movieId) }.map( _._1)
[/code]

Evaluating the Model

Now let us evaluate the model on test data and get the prediction error. The prediction error is calculated as Root Mean Square Error (RMSE)

[code lang=”scala”]
val predictedRates =
model.predict(test.map { case Rating(user,item,rating) => (user,item)} ).map { case Rating(user, product, rate) =>
((user, product), rate)
}.persist()

val ratesAndPreds = test.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictedRates)

val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => Math.pow((r1 — r2), 2) }.mean()
[/code]
The RMSE value was: 0.95.

Making Recommendation for User U4

Here a new user id (0) is associated with each movie user u4 has not seen so far. And MatrixFactorization model is used to predict values for the unknown movies. Method predict returns Rating(user,item,rating) RDD which is again sorted by Rating to recommend best movies out of entire list.

[code lang=”scala”]
val recommendedMoviesId = model.predict(moviesIHaveNotSeen.map { product =>
(0, product)}).map { case Rating(user,movie,rating) => (movie,rating) }
.sortBy( x => x._2, ascending = false).take(20).map( x => x._1)
[/code]
The above code predict values of (user,Item) where user id is 0 i.e. of U4 and item is all movies he hasn’t watched. The result is further sorted by the rating and the top 20 movie Ids are returned.
You can view entire code example at: GitHub

References

[1] Collaborative Filtering Apache Spark Documentation.

--

--

Knoldus Inc.
Knoldus - Technical Insights

Group of smart Engineers with a Product mindset who partner with your business to drive competitive advantage | www.knoldus.com