Movie Recommendation with Collaborative Filtering in Pyspark

Abid Merchant
Analytics Vidhya
Published in
3 min readJul 26, 2020

Hey Fellas,

Today we will explore a cool Machine Learning algorithm known by the name “Collaborative Filtering with ALS” which is typically used to recommend a movie/song or any feature based on user and the user’s preference of movie (judged by ratings given by that user). This Algorithm is used by many renowned companies like Youtube, Reddit, Netflix and me(for article :P). Cool isn’t it.

So basically, Collaborative Filtering model takes userId, item column and ratings column as feature columns and trains itself to recommend items for each user. So, lets not waste much time on the boring theoretical part and jump into the fun coding part. Still, if you want to read more about the Algorithm you can visit here.

Now, let’s begin! I will be using publicly available dataset for movies which can be found here. So, firstly lets read and display the data.

ratings = spark.read.option(“inferSchema”,True)\
.option(“header”,True)\
.csv(“ml-latest-small/ratings.csv”)
ratings.show(5)
Movie Dataset

Now lets move forward and initiate our ALS model and also split our dataset into two parts for training and testing (de facto 80:20 ratio).

from pyspark.ml.recommendation import ALSals = ALS(maxIter=10, regParam=0.5, userCol="userId", 
itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop")
train, test = ratings.randomSplit([0.8, 0.2])

Just a quick overview of parameters used, the three parameters userCol, itemCol and ratingCol is for setting features, maxIter is for number of iterations ALS will do (can be decided as per use case), regParam specifies the regularization parameter in ALS (default is 1.0) and coldStartStrategy means when prediction is run and a model is not trained for a particular user/ ratings or items are not found for a user then it will drop those users if given “drop” as parameter’s value.

So, now lets train our model and generate predictions.

#Training the Model
alsModel = als.fit(train)
#Generating Predictions
prediction = alsModel.transform(test)
prediction.show(10)
Prediction Generated

So, some prediction seem accurate but some suck(let’s be frank :P), anyways now that our model is trained, let’s check how good it is.

from pyspark.ml.evaluation import RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mse", labelCol="rating",  predictionCol="prediction")mse = evaluator.evaluate(prediction)print(mse)Output: 1.0168009658367847

Hmm, accuracy is also not bad(unnecessary judged the model at start). So, the model seems to working quite well. So let’s ask him to recommend top 3 movies for every user.

recommended_movie_df = alsModel.recommendForAllUsers(3)recommended_movie_df.show(10, False)

The output comes as a list of recommendations for each user which can easily be transformed columns. You can also pass the dataset and generate recommendations for it by using recommendForUserSubset() function.

So, that’s all from my side for this one. Hope you would have liked my article, if any suggestions please comment below. Do checkout my previous articles as well, link is given below:

So, see you next time, ta ta, bella ciao!

--

--