Using a RNN recommendation engine on a daily basis: feedback, tweaks & lessons learned

Nicolas Seichepine
Decathlon Digital
Published in
13 min readFeb 9, 2024

In a previous post, Building a RNN Recommendation Engine with TensorFlow, details were given on a recommender system used at Decathlon. This system is designed to rank ~10K products for audiences of ~100K to ~10M users around the globe. To do so, users’ purchase history are treated as sequences of products. The recommender system can therefore rely on the training of a recurrent neural network (RNN, by which we’ll actually refer throughout this post to a full RNN-LSTM neural network where a self-attention layer has been added) to generate a ranked list of products likely to follow previously purchased items. This sequential modeling led to significant improvements over a more classical matrix-factorization-based collaborative filtering approach.

This RNN-based recommender system has been running in production for more than 2 years, is now deployed in 20+ countries and is weekly queried ~10M times through an API. Lots of insights were gained from the observation of its daily behavior. Those insights led us to several modifications of the initial solution, including post-processing and algorithmic tweaks, that will be covered throughout this post.

Each section below presents a specific modification. We first discuss the diversity of the model outputs, then the context making a user eligible for recommendations, the encoding of the “nature” of the products and the encoding of the temporal information. Finally, we validate the adequacy of the model’s architecture and cost function with respect to the nature of the input data and the actual behavior of a user.

We advise the reader to first skim through the original post to get acquainted with the initial technical solution, which will facilitate the reading of this post.

Outputs diversity and relevance

Given inputs (user id, user history, …), a recommendation system provides a ranked list of items, from the most to the least likely to match the user’s interest. We emphasize that:

  1. Being “of interest” is actually not well-defined, and may vary depending on the context.
  2. The most “practical” ranking also depends on where, when and how results are displayed.
  3. At the end of the day, a user is shown only a few items from the catalog — the top ranked products — on specific selling floors across different channels: email, mobile application or website. Some crucial questions arise from this observation.

Should we recommend previously purchased items?

… Or more specifically: “Is it interesting to provide a direct link towards a consumable product that a user has already bought?”. It’s tempting to answer by the negative: whether it’s by a direct bookmark, by “favorite products” features, from past purchases links or simply by previous experience, it’s probably easy enough for the user to retrieve this product anyway. From a business perspective, display slots are limited in number, hence should be kept for highly valuable uses rather than simple reminders or shortcuts.

Considering the nature of the RNN and its training, it is interesting to note that a given product is likely to appear as a prediction for a given user even though it was already present in this user’s purchase history only if multiple purchases of that product is a common pattern. Buying the same product in different baskets is an informative behavior that the model must be able to capture. Hence it would be quite aggressive and potentially harmful to remove redundant items across training sequences.

It thus seems valuable to preserve the training data and model integrity, while implementing a simple post-filtering operation that gets rid of products similar to the ones previously purchased. In particular, implementing such a filtering operation had a positive impact with a +2% conversion rate increase (significant at 99.9+% level).

Should we trade relevance for diversity?

Let’s consider the following exaggerated example: imagine that a user is a serious runner, and already purchased most equipment apart from a GPS watch. It’s very natural from the model viewpoint to predict that a GPS watch is likely to be the next purchase. In this case, a selection of the (numerous) GPS watches would be listed as the top ranked predicted products and thus displayed as the only recommended items. Again, this would be a waste of display slots: if the user is indeed interested in GPS watches, he’ll probably click on any displayed watch to observe its characteristics, and from there will have links towards other watches anyway. But if the user is not interested, the opportunity window is immediately closed if only watches were displayed while some slots could have been used for truly different products.

This behavior may implicitly mean that the model should be trained at a coarser level, e.g., on sequences of product families, before refining the prediction at a later stage of the recommendation pipeline — an approach that is currently being considered. In the meantime, a simpler approach is to rely again on post-processing, which has been done yielding a positive relative improvement (significant at 95+% level) of the conversion rate at deployment time. The post-processing consists in looping first through all predefined products categories as they appear in raw predictions then for each category through the raw predictions. This is highlighted in the following scheme (figure 1) where background colors represent products categories:

Figure 1: reordering to maximize diversity

Implementation performance is a worthy question, as one may have to rank up to ~10 000 products for up to ~10 million users. This is illustrated by the following example where keeping a rank in a “pointer” spirit saves lots of computation by reducing the theoretical complexity from O(n^2) to O(n.log(n)):

def reorder_predictions_naive(predictions: list, products_to_categories: dict):
output_predictions = []
left_predictions = predictions.copy()
while len(left_predictions):
blacklisted_categories = set()
for pred_idx, pred_product in enumerate(left_predictions):
category = products_to_categories.get(pred_product, 0)
if category in blacklisted_categories:
continue
else:
blacklisted_categories.add(category)
output_predictions.append(pred_product)
left_predictions[pred_idx] = None
left_predictions = [pred_product for pred_product in left_predictions
if pred_product is not None]
return output_predictions


def reorder_predictions_idx_sorting(predictions: list, products_to_categories: dict):
base = len(predictions)
rank_to_pred_product = {}
categories_occurrences = defaultdict(int)
for pred_idx, pred_product in enumerate(predictions):
category = products_to_categories.get(pred_product, 0)
categories_occurrences[category] += 1
rank_to_pred_product[base * categories_occurrences[category] + pred_idx] = pred_product
return [rank_to_pred_product[index] for index in sorted(rank_to_pred_product)]

User eligibility to recommendations

On top of the nature of the purchased items, the sheer counts of purchased items and distinct baskets are key features of input sequences. In particular:

  • The larger these numbers, the more information and patterns can be captured by the model to then provide good predictions.
  • The sequences’ length must be greater or equal to two at training time, as one element is used as target.

If sequences are made of purchased products (“sales only” in figure 2), the situations where one can associate at most two products to a given user on a typical 1-year time window occur frequently. These situations are of limited use, if not completely unusable, and affect 69.47% of users as highlighted in the following graph (figure 2):

Figure 2: how many products can be associated with a given user?

It’s therefore quite natural to try to add some more data such that:

  • More users qualify for a prediction by the RNN-based system.
  • We avoid, as much as possible, generating predictions using short sequences.

One could imagine many ways to provide some more insights to the system e.g., any kind of user knowledge representation used as input at a later stage/layer of the neural network. Rather, we’ve chosen to borrow the main idea of session-based recommender systems by enriching the purchase sequences with web visits sequences (“sales & web” in figure 2) — after all, browsing through a product page is indeed a marker of interest for that product.

To avoid confusing the algorithm, one must specify whether products in the input sequence were purchased or simply viewed online. This is done with very little modification:

  • Keep a column “is_purchase” that specifies the nature of the interaction for each product of the sequence (see table 1 below).
  • Concatenate this column to the retained embedding (see figure 3 below).
Table 1: training data snippet
Figure 3: incorporating multiple kind of data

This simple modification actually improves the proportion of users for which we have relevant information at training/inference time, as can be seen in the preceding diagram (figure 2, e.g., users proportion with [0–2] associated products decreases from 69.47% to 64.47%).

But from a more downstream perspective, a production AB-test highlighted a 2.5 % click through rate (CTR) uplift at deployment time (significant at 95%).

Can product descriptions/nature be directly provided to the model?

As mentioned above, each product is represented by an embedding vector, which in practice is an array of floats that represent the product’s nature. These embeddings were previously learned during the training task via the ​​tensorflow.keras.layers.Embedding layer. This was leaving the training process fully responsible for identifying the key features of each product, from sequences of products that were purchased or simply browsed.

However, this learning process might be suboptimal, considering that:

  • The data used during training is time-dependent, relatively limited, and country-specific.
  • We could leverage external product information independently from the way they have been purchased (e.g., textual product descriptions, sports-products relationships, products’ composition, product positioning on a structured products hierarchy, …). This information is shared across countries and relatively stable across time.

One can therefore decide to compute embeddings in a separate process, with the ability to use any available data.

From a coding perspective, reusing those embeddings in the usual training code simply amounts to:

new_layer = tf.keras.layers.Embedding(
trainable=False,
weights=[embedding_weights_matrix]
)(tf_input)

In this instance, embedding_weights_matrix is the result of previous computation. As a nice side effect, the number of trainable parameters is reduced (e.g., -16% for France using precomputed embeddings of size 100), thus reducing the overall training time.

The many ways to compute (relevant) product embeddings is out of the scope of this article. Nevertheless, let us mention that for now, we have been deriving product embeddings via a Word2Vec procedure applied to baskets of items. This led to a +3% conversion rate uplift (significant at 99.9+% levels).

How can the model fully capture temporal dynamics?

In the input data provided to the algorithm, each product is associated with a “recency” feature, namely a bucketed version of the number of days elapsed between:

  • The interaction date with the product (purchase/view).
  • The current date.

This clearly provides the algorithm with useful information about the context in which interactions were observed, but this may not be enough.

Let’s consider a toy example where we have bucketed recency s.t. one bucket = 1/100 year (~3.6 days are sent to the same bucket), and observe following sequences where Wi are winter products while Si are summer products:

  • W1, W2, S1, / 99, 99, 50
  • W1, W2, W3, / 99, 99, 2

Basically, some users only practice winter sports, while others practice both winter sports and summer activities. The simplest way would be to train s.t.:

  • [(W1, 99), (W2, 99)] → S1
  • [(W1, 99), (W2, 99)] → W3

Which means:

  1. That summer products and winter products are equiprobable after a sequence of winter products.
  2. That current date plays no role in prediction.

This may lead to awkward situations where we end up recommending summer products right in the middle of winter.

Rather than associating a product with its recency, we may choose to associate a product with the recency of the next product in the sequence, which would amount to:

  • [(W1, 99), (W2, 50)] → S1
  • [(W1, 99), (W2, 2)] → W3

And at prediction time, we may arbitrarily insert an arbitrary “1” recency to force a prediction for current time:

  • [(W1, 99), (W2, 50), (S1, 1)] → ?
  • [(W1, 99), (W2, 2), (W3, 1)] → ?

That’s what was done in the initial system. It still feels awkward though that the training has no direct access to the time delta information (relating next product recency with current product recency can be done only through the memory belt of the LSTM cell) and products cannot be directly grouped according to their recency. Hence the temptation to work with:

  • [(W1, 99, 99), (W2, 99, 50)] → S1
  • [(W1, 99, 99), (W2, 99, 2)] → W3

Which, if only the last element of the sequence is used as a target, may be simplified to:

  • [(W1, 99), (W2, 99)] + 50 → S1
  • [(W1, 99), (W2, 99)] + 2 → W3

This has been implemented in conjunction with a neural network architectural change, for which results are discussed in the next section.

Departing from ordered sequences to unordered sequences.

All along this post, we’ve mentioned that the input data is treated as an ordered sequence of products. Additionally, we also acted as if predicting the last element of a sequence mimicked what happens when we’re trying to predict a future purchase.

This representation does actually not perfectly depicts reality:

  1. Within a given basket, there is no definite ordering of products (cashier’s receipt only matches the order in which products were scanned, which might be loosely related to either store disposition or some users’ logic in their browsing, but there is definitely no semantic signal equivalent to purchasing a product after having purchased and used a previous one).
  2. Predicting a future purchase would be equivalent, at training time, to predicting the last basket considering previous purchases. Predicting the last item of the sequence is equivalent only if the last basket is made of a single element (~40% of the users only in FR, e.g.), otherwise it rather amounts to predicting one element of the last basket knowing past baskets and… other elements of the same basket. That last point actually affects the training all along the sequence, as the attention layer which is supposed to be causal — using only past information — may use past and current information even though it avoids future leakage.

This amounts to saying that session-based recommendation algorithms do not perfectly fit the requirements for next-basket recommendation problems. To fix the modelization issue, we actually considered two solutions:

  • Work with sequences of sets of products rather than with sequences of products:
Figure 4: handling baskets explicitly as sets
  • Keep up with sequences of products, but relax the way we interpret sequences and isolate all products from the last basket:
Figure 5: handling baskets implicitly, with proper positional encoding

Regarding the former, one can imagine representing a set of products:

  • Either by encoding tuple of ids and leaving their representations up to the tf.keras.layers.Embedding layer — which may be difficult, considering that the theoretical number of baskets is pow(number_of_products, max_basket_size) and in any case quite high (rather ~10M for a set of 10K products in practice).
  • Or by first encoding individual products, then defining the representation of a set by some kind of reduction of the representations of its members, agnostic to the set size (e.g.: average pooling), see diagram above.

Regarding the latter, one can replace the LSTM + attention layers by the encoding part of an actual transformer architecture, where the recency feature will be used as positional encoding. This way, the ordering is managed in an implicit way: the model actually gets the information that a product has been purchased before another one (through positional encoding) if applicable, and two products of the same basket will have an identical positional encoding without any specific assumption on their order. Still, the attention layer is fed elements in a given order and the question of the causality within that layer remains: the simplest way to avoid any spurious assumption is in fact to make the attention bidirectional while totally isolating the last basket, used as a target during training.

Given the relative implementation complexities of those two approaches, the latter was retained.

One aspect that has not been evoked yet is the loss function:

  • With a single product as a target, tf.keras.losses.SparseCategoricalCrossentropy was used.
  • This approach is not feasible when the target is a basket — hence multiple products — since that loss does not accept a multi-dimensional target.
  • We actually tried a full tf.keras.losses.CategoricalCrossentropy but the impact on computation was prohibitive on our data (slowdown factor > 50).

Luckily for us, the cross entropy loss we are using is actually separable for disjoint ground truths as can be seen in the following example: let i in [1..n] be products’ index. A basket B is the set of indices of products that were purchased by a specific user at a specific time. It comprises elements pj with j belongs to [1..m], m being the number of elements in that basket. One can then denote by Bj a “fictional” basket that only comprises pj. If yi are the outputs of our model, the loss is then:

Which means we can simply “explode” sequences ending by a size-n basket into n similar sequences ending by a size-1 basket (see example below in table 2: data for each user are replaced by fictional users with edited sequences), keep our sparse cross entropy loss, and actually find ourselves working with exactly the loss we were interested in.

By deduplicating sequences ending with multi-products baskets, we admittedly artificially augment the dataset size, affecting the training time. But in practice the observed augmentation lies around 70% hence being much more acceptable than what was happening using a non-sparse cross entropy loss.

Table 2: snippet of training data reworked to handle multi-products outputs

This has been implemented and is currently AB-tested in production. Test set accuracy metric relative improvement was higher than 14%.

Conclusion

Throughout this post, we’ve discussed specific aspects of a RNN recommendation engine, with an emphasis on the nature of the data that flows through the model, and how it may naturally steer forwards the algorithm’s behavior. This mimics the process that was followed at Decathlon and led to the implementation of several tweaks. We hope that those tweaks and the associated feedback will benefit your own recommendation algorithms, and in any case will have provided valuable insights about the key technical features of a recommender system.

… With many thanks to everyone involved in the development of the recommendation engine and/or the writing of this post including but not limited to Guillaume Gautier & David Dégardin!

--

--