Product Recommender System: How The Journey Ends

Published in

Moosend Engineering & Data Science

7 min readJul 10, 2019

Good day everyone! It’s me again, your favorite data scientist extraordinaire and I’m here with the very last article of our series. Yes, I am sad too, but I won’t be away for too long, so stay tuned!

But back on today’s article. I am here to present how the whole recommender system works, combining all the methods that I came up with and described in all my previous articles.

Apart from that, I will also remind you why we went through so many adventures to build the recommendation engine we built: The end goal of our quest was to create a data-driven model for product recommendations that works for all eCommerce stores in Moosend’s database.

This model should have the ability to produce personalized product recommendations based on the customer’s past interactions, enhance information from multiple shops and work for a plethora of different eCommerce stores.

Let’s recap first

In order to come to a successful conclusion, I will need to describe the model in full.

First in order of business was preprocessing, data transformation and the preparation for the model itself.

We needed to get the products off the database and perform the Named Entity Recognition. The goal was to match a product into a customer gender, using the exchange loop we talked about previously, and then to create the Product Clusters. These clusters contained the same or very similar products across all shops in the database.

Afterwards, we went ahead and tried to create Shop Clusters that contained the most similar shops, in order to maximize the learning process and reduce calculation time and the size of the matrix itself.

You see, these consumed more memory than we could afford.

Lastly, we did our best to enhance the information of the customers’ interaction with individual products into product clusters. That way, we could share the findings that came up from studying a big eCommerce store with a smaller one. This was made possible, through sharing the purchase patterns.

Having all the information at the ready, me and my team sailed to another destination: Data transformation. We used the data in conjunction with the Interaction Score Formula, in order to generate the interest score for every customer-product interaction.

The last step of this part of the quest was the decay of the generated interest score, which we calculated based on the time interval of every interaction.

After we managed to process the interaction data, we fed them into our model. We fashioned an interaction matrix (R) for every shop cluster. The matrix had rows, each one representing every single customer, and columns used to represent the product clusters. We made sure to leave the cells empty, when there were no interactions.

The next step was to remove the bias from the R, then decompose the R matrix into P and Q.

Our last step was to calculate the dot product of P and Q and add the bias we removed previously.

The name of this brand new little trinket-the generated matrix of the process, that is-was R-hat, meaning that it was the same as the matrix R, except it had all its cells filed with the predicted interest score for every customer-product cluster interaction.

Model Workflow

Behind The Recommender

Our journey resulted in our creating the R-hat matrix with all the predicted interest scores for every customer-product interaction.

Generally speaking, what we had to do is to recommend the products with the highest predicted scores. In order to make this work though, and to make the model more “intelligent”, we had to add simple, man-made logic behind the algorithm.

What was that, you ask?

Well, it depended on what each scientist would like to achieve. If, for instance, you’d like to avoid recommending a product that someone has already purchased, this is where you’ll have to pinpoint your most important parameter.

But why would you recommend a product that someone else has already purchased?

This, my fellow adventurer, is something that depends solely on the nature of the product itself.

For example, we would very much love to see a recommendation for the same coffee type or brand, or the same shampoo type or brand, but we can’t recommend the same t-shirt or the same construction tool.

Important Parameters To Keep In Mind

Everything is fine and well so far, isn’t it? I bet you’ve got a key question though: What kind of parameters did we take into account?

Let me tell you a thing or two about parameters and parameter tuning. First off, here are the most important parameters that you should keep in mind along the way:

i) learning rate ( how our model changes beliefs ),

ii) number of latent factors (in how many terms we describe your customers and your product ),

iii) selection of gradient descent optimizer ( how aggressive or how smooth the model change the weights to find the optimum result [adagrad, adadelta] ),

iv) epochs ( how many times pass the data till the models learn ) and

v) the loss function to measure the error are the most important.

Generally, there is no rules to tune the above parameters but I can mention some tips for you that I got through my experience.

Remember to keep the learning rate low. That is between 0.001 and 0.01.
As far as the latent factors are concerned: These factors are used to determine the terms in which we are going to describe every customer in the P, Q matrix. The size of the matrix and the complexity of the situation lead us to represent every customer and every product with 100 latent factors.
Mind you, I discovered a pot of gold through that one: More latent factors can lead to better results, but adding more and more makes zero difference after a certain point.
Gradient Descent Optimizer works best with the adadelta optimizer. Both adagard and adadelta proved to give the same results at the end of the process, however, adadelta needed less epochs and, by extension, less time.
Finally, when it comes to the loss of function, I personally chose to go with the Bayesian Personalised Ranking (BPR) option. The reason was that I got the best results from the simulation process we managed to come up with.

Let’s talk Numbers!

Here’s the model’s performance for cluster and individual product levels:

Our new model is clearly improved, compared to the previous one, which is a huge win for various reasons. Like, for example the following one: Think of a recommender engine that lead a customer exactly where you wanted them-to the purchase of a product.

Think of that purchase having been made with around 8% accuracy on the individual product-8 products out of 100 sets of 5 products our system recommended, that is.

This can take the shop’s revenue on a whole new level!

If my journey through recommender systems taught me anything, this was the fact that the strongest metric any data pirate like myself could ever have, is accuracy.

I needed to analyze customer segments, in order to map out how the information of a customer can influence the recommendations in cluster level and individual product.

And if you need some more insight on customer segments and validation, here’s your trusted map!

Takeaway

But what did we achieve through building this model, besides accuracy?

We managed to enhance the information from similar eCommerce stores, so that the small-medium eCommerce stores with not that much data, could benefit from the larger ones.

Not only that, but our system also works for a variety of different businesses, because it was “taught” how to adjust to different customer purchase logic.

But is performance our only indicator of success? No, not by far.

Our golden achievement is that our recommender system is very good in terms of memory and calculations, seeing as my crew and I managed to build a matrix that was smaller in size, thus making it more manageable.

This means that it can actually “use” the techniques I mentioned before, without being too large or too slow, as there was less complexity in calculations and less memory to be consumed.