image: https://realpython.com/build-recommendation-engine-collaborative-filtering/

Part 1: Merchants Recommendation with GRU4Rec and Offline Evaluation

Published in

ShopBack Tech Blog

6 min readApr 28, 2021

TLDR: This is the first post in a trilogy.
We are a new MLE team, and we managed to build an in-house recommendation model along with our ML process flow.

Introduction and motivations
Our team was freshly put together in September 2020, and we were pulled in numerous directions to get started. With some planning and scoping we managed to narrow our focus to 2 challenges; create a scalable workflow for (future) ML process and build + serve an in-house real-time recommendation system by early December 2020 (to replace a third party solution¹, which will be referenced as the current model henceforth). There wasn’t anything wrong with the current model, it was doing a great job. But we felt this would be a good challenge as a new team since it involved the entire lifecycle of a ML project. In addition, having an in-house model would also translate at garnering more control and flexibility in the model architecture. For the interest of scope, we will elaborate on our ML Ops workflow in a future post.

There were 3 stark technical components that stood out in building the recommendation system. Building the model, serving the model in real-time and having a retraining model pipeline. This will be a trilogy series, which we will be covering how we built and evaluated the model in this post. And the remaining two components in subsequent posts.

Building the model
Standing on the shoulders of giants, we drew inspiration from the current model, and dived into the area of sequential recommender systems (SRS). We were thankful that they are an array of academic papers² (and blog post³) written to summarize the different available models, we eventually shortlisted 2 models (GRU4REC and STAMP) since it was documented how they are more effective in online evaluation⁴. But ultimately, deciding on GRU4REC given the model implementation is much simpler.

Before, we go into some details of the model, we thought it might be worth to explain what SRS is; it suggests items to a user by modelling the sequential dependencies over the user-item interactions in a sequence.

SRS will require some retention of memory to determine which part of the sequence is useful and not, and you might have guessed it, we will rely on a recurrent neural network (rNN) architecture for our model with LSTM (long short-term memory). Here we try to predict the user’s next merchant⁵ click.

Vanilla GRU4REC
GRU is a variant of rNN, we will skip the explanation of GRU and provide links with elaborate details⁶.

The vanilla GRU4REC only considers the user’s clicked/purchased items within a session (or timeframe) and tries to predict the next click/purchase.

As such, the model architecture is relatively simple, we can apply one hot encoding to the item sequence (input) and pass over to the GRU layer, which is passed onto the forward feed layer and ultimately predicting a (likelihood) ranked list of the next item.

GRU4REC + Contextual features
We wanted to add more features to the vanilla GRU4REC, and we had to make minor changes to the above. There are a few ways⁷ to go about adding features, but we decided on:

We added contextual features such as user purchase behaviour, platform and date + time of clicks. But essentially, there is no change to the model architecture except changing the input dimensions from (n x 1x num_items) to (n x 1x (num_items + contextual_features)).

Offline evaluation
We had to internally measure if the new model is at least on par with the current model. The primary evaluation metric we used was mean reciprocal rank⁸ (MRR) @10; if the next item is not within the top 10 prediction, the MRR will be 0.

Both models were trained and evaluated with the same (training and evaluation) datasets. Here is also the step, the new model hyperparameters⁹ are tuned by (grid) searching for the best combination of hyperparameters to achieve the best results.

This is how the new and current model matched up in the offline evaluation:

And the breakdown of each country:

There is a recurring trend that the countries where the current model performs slightly better (in offline evaluation) to the new model, have a lesser number of merchants than the other countries and the top 10 merchants take up a majority share of purchases. The next recommendation might be suboptimal just because users tend to revisit the more popular merchants.

Yet overall, given how the new model overall MRR is almost on par with the current model. We felt confident enough that we are able to look towards the next step of serving the new model in production.

Closing thoughts
We are somewhat pleased that we are able to reference and build a model architecture that is on par with the current model. Looking back, this was the simplest and least time consuming step considering that it’s more binary (and concrete if we are able to proceed) than the subsequent parts. Stay tuned for our next post which will be how we track and store our model with MLFlow and the evaluation of our online A/B test. Share with us your thoughts below, and give us a few claps if you have found this useful or interesting!

❗️ Interested in what else we work on?
Follow us (ShopBack Engineering blog | ShopBack LinkedIn page) now to get further insights on what our tech teams do!

❗❗️ Or… interested in us? (definitely, we hope)
Check out here how you might be able to fit into ShopBack — we’re excited to start our journey together!

[1]: The third party solution uses a HRNN model architecture
[2]: Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison — Zhu Sun, Di Yu
[3]: RecSys 2020 — Takeaways and Notable Paper https://eugeneyan.com/writing/recsys2020/
[4]: From the lab to production: A case study of session-based recommendations in the home-improvement domain — Pigi Kouki, Ilias Fountalis
[5]: Merchant examples: Shopee, Lazada, Nike and Foodpanda
[6]: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21, https://en.wikipedia.org/wiki/Gated_recurrent_unit
[7]: https://recbole.io/docs/_images/gru4recf.png
[8]: MRR is a measure to evaluate outputs that return a ranked list of objects (to queries)
[9]: Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning

Part 1: Merchants Recommendation with GRU4Rec and Offline Evaluation

Written by Shu Ming Peh