Building an Implicit Recommendation Engine in PySpark

Uttasarga Singh
The Startup
Published in
6 min readNov 20, 2020
Recommendation System

Imagine you are on a spree of shopping online and you looked up various Winter Jackets that you will purchase, so that you can withstand the Cold Weather with Winds this year (It is going to be a bit windy this year). You did love a jacket which belongs to a particular brand named: 1. You viewed the jacket for several minutes, read the whole description of the jacket material, the size etc. and then you wanted to explore some other brands to see if they have the similar items like the one you viewed for this long. And Boom! You are recommended several products which quite match with the product you viewed earlier. In this post, We are going to develop a Recommendation System using PySpark ML-Lib’s Alternating Least Squares API. You have in hand, a data set which consists of the Following:

Variables belonging to the data set

Here, We will consider the 4 important variables for modeling dataset:

Renamed the Columns for ease of use!

Building a Recommendation System comes with its own challenges; Implicit Data and Explicit Data being the first challenge to overcome. Implicit Data means a data which is generated on a Real-Time manner by the User; and the User doesn’t have any idea about this. Implicit Data examples are: Number of Views on a product by a user, Time Spent by a user exploring a Product, Recent Clicks of the User on a Website etc. These type of Data are not refined and Data Scientists have a huge Task in their hands to refine the Data set so that they can include this in the Model. If we, as a Data Scientist; are able to include these data sets in the Model, then our Model will train itself on such Real-time data and give us accurate predictions regarding the potential products that a Customer could purchase. One of the Implicit Data used in this Model building is the Quantity of the Items purchased by a Single User. These Items do belong to a certain Brand and hence, we are having the Number assigned to each brand; so that recommendation System can recommend a User, products belonging to his/her favored brand.

Further, there is exploration about the modeling dataset for better idea about various metrics that comes into my mind.

Out of the 9M Rows in the Dataset, these are the Customers present.
Total Products
Total Brands

After exploring about the Data more, We should start diving in for the task for building a Recommendation System. We will use the technique of Collaborative Filtering; where we will use Matrix Factorization technique in a way in which a large number would be factored into two smaller prime numbers. We end up with two or more lower dimensional matrices and their product is equal to the original matrix.

We want to solve a matrix with consists of million dimensions but a User wont have these many dimensions for his/her tastes. Even if the user did see many items on a website, it just portrays that the User is having various interests. Using ALS gives us a lead in dissolving a bigger matrix into 2 parts namely; all users by their taste and Items by their taste matrix. These dimensions are known as Latent Features that we learn about the User from the observed Data. Recommending Items becomes easier because ALS represent each user as a vector of their taste while representing items as a vector of the types of taste they are a part of.

If I get back with the ALS functioning and their value prediction:

Rkj = Uk *T(Pj) where R is the Rank Matrix and k is the User in the user Matrix U, while P is the Product Matrix and Pj relates to product J. If we need to predict how User Uk would rate product Pj, we will multiply those two vectors. This is called Alternating Least Squares because we generate the two matrices U and P by first fixing U and optimizing P and then fixing P and optimizing U. Optimizing generally means here that we are minimizing least squares error on those recorded values that we have in the Rank Matrix already.

Now, since the Data is implicit, we don’t have a rating for the Items which user likes and the User gives a rating explicitly so that we can conclude whether he/she likes it or not. But, imagine of we have an implicit data which shows the total number of purchases that are done from a single brand, where User 1 purchases 2 products of Brand 5, while User 2 purchases 1 product of the same brand. This comes to the idea of User preference; whether the user would like any other item of this brand. We can denote this as:

Preference, P =(0,1); where 0 denotes no preference while 1 denotes user likes it.

The Next Model is R; recording which denotes the number of purchases from a brand: Rui.

If we integrate the above two methods, we will get this:

Pui = 1 if Rui>0 & Pui = 0 if Rui = 0

This means that if you purchase a product from a specific brand i.e. Pui = 1, you like that brand or you don’t like it if it is 0.

Now, if we add a notion of confidence in the Model, Cui; we can model it in the following Manner:

Cui = 1 + a*Rui

which helps in having some confidence in smaller recordings, while having greater confidence in larger recordings.

Alpha, a is the tuning parameter which measures the difference of weights between recording of 0 and recording of 1.

If we integrate all the above models together, we are implementing ALS in order to minimize the following equation:

Cui(Pui — Uu * X-Ti).

So, let us start implementing the Model in PySpark.

ALS Class in MlLib.
Included ParamGrid for testing various Parameters of the Model
A Single model is tested but please add various parameters to check the Best one.

I included a single Model because I am implementing this on my local machine and hence I get Out.Of.memory Error because of memory constraints. But, if you have Azure Databricks to work on, please feel free to implement 4 models together.

The Predicted Values which denotes whether User will purchase the product again.
Recommendations from various Products for USER 1591
Products that USER 1591 purchased already.

Here, Brand 107 is the prime favorite brand for user 1591 as he is getting recommendations from that brand, while he purchased products of that brand and also some neighboring brands as well.

The whole Code for this Model is on GitHub. Please feel free to reach out to me if you need a help for the same.

Please connect with me on other Social Media Platforms and you can check out some of my Projects in GitHub as well.

  1. LinkedIn
  2. GitHub
  3. Twitter
  4. Reddit
  5. Quora

I look forward to help someone who is interested in collaboratively working on new Machine Learning Algorithms to learn from.

--

--

Uttasarga Singh
The Startup

Machine Learning Engineer / Software Developer. More than 3 years of experience in developing/deploying Machine learning Models and Web-based applications.