Winning the SIGIR eCommerce Challenge on session-based recommendation with Transformers

Published in

NVIDIA Merlin

11 min readJul 15, 2021

It has been a great year for NVIDIA on RecSys competitions, having won four contests in the last 12 months — ACM RecSys Challenge 2020 and 2021 (both organized by Twitter), WSDM 2021 WebTour Workshop Challenge 2021 (organized by Booking.com) and now the SIGIR 2021 Workshop on E-commerce Data Challenge (organized by Coveo) — in a collaboration between scientists and engineers from the Merlin and Kaggle Grandmasters Of NVIDIA (KGMON) teams.

This post is about our winning solution for the Session-based recommendation task of SIGIR eCommerce Workshop Data Challenge 2021, which placed 1st for the Subsequent Items Prediction Leaderboard and 2nd for the Next Item Prediction Leaderboard.

Our solution included data augmentation and feature engineering techniques and models consisting of an ensemble of two different Transformer architectures — Transformer-XL and XLNet — trained with autoregressive and autoencoding approaches inspired by Natural Language Processing (NLP). We leveraged the rich information provided by the dataset, and explored different ways to combine tabular data of user interactions events (e.g., clicks, add-to-card, remove-from-cart, purchases, search queries) with unstructured data (product description and images) in a multi-modal approach.

Our detailed solution is described in our paper and solution code is available on GitHub. In this post, we will briefly describe the competition and our solution and provide some takeaways that might be instrumental to help you build your own session-based recommender system.

The competition and dataset

The SIGIR 2021 Workshop on E-commerce Data Challenge was organized by a partnership between academic researchers and Coveo company. The competition presented two tasks: (1) session-based recommendation and (2) purchase prediction (cart-abandonment). We competed for both tasks, but for space reasons, this post describes only our solution for task (1). The solutions source code and description of our approach for task (2) is also available on our GitHub repo.

Session-based recommendation is an important task for domains such as e-commerce, news, streaming video and music services, where users might be untraceable, their histories can be short, and users can have rapidly changing tastes. Providing recommendations based purely on the interactions that happen in the current session.

For this competition, models were evaluated for their ability to predict the immediate next product interacted by the user in a session (Mean Reciprocal Rank — MRR) and to predict all subsequent interacted products in the session, up to a maximum of 20 after the current event (F1 score).

This competition provided a very rich dataset in terms of the diversity of data that could be relevant for personalized recommendation in e-commerce. It contains more than 37 million events distributed in almost 5 million sessions and is composed of three tables: browsing events, search events and sku content. The dataset includes logged user events on product pages (view, detail, add-to-card, remove-from-cart and purchases), views of non-product pages (page views) like FAQ and promotions and search events. It also contains product metadata information like the quantized price, the product category and vectors representing the text and image of the product. More details of the dataset and tasks can be found in the competition paper and GitHub repo.

Preprocessing and Feature Engineering

Feature Engineering is the process of extracting meaningful attributes from data that help ML models to detect patterns and perform accurate predictions. For this competition we are interested in predicting the next user interactions within their sessions, it is natural to leverage features from the previous session interactions as they bring contextual information of what the user is currently looking for.

To encode interaction features, we used traditional encoding techniques and the GPU-accelerated ETL library NVTabular. For categorical features like event type, product category and sub-category, and the bucketed price, we used the label encoding (categorify) technique, which is basically converting the original categorical values into contiguous ids so that the embedding tables for those features are memory-efficient in the model side. We also created some numerical features, like the product price divided by the avg. price of its category products (to surface patterns on whether user is targeting cheaper or more expensive products), the product recency (time since the product was interacted the first time) and temporal cycling features from the hour of day and day of the week, using standardization for normalization. Those and many more feature engineering techniques are easily available as ops in the NVTabular library, like exemplified in the following code snippet (adapted from our preprocessing notebook).

Listing 1. Feature engineering with NVTabular

After the preprocessing of interaction features, it is necessary to group interactions by sessions, as each training example should be a session for next-click prediction. In that case, features are aggregated as lists sorted by the interaction timestamp. This can be accomplished by the GroupBy op, as in the following example. With the ListSlice op, we can truncate the session list features. In our case, we truncated sessions to the last 30 interactions.

Listing 2. Grouping interactions by sessions with NVTabular

In addition, some preprocessing approaches were very instrumental to improve models performance, described as follows.

Removing repeated interactions - In e-commerce datasets users may interact many times with the same product in different event types, e.g., by clicking, checking product details, adding and removing from the cart, or purchasing. For this dataset 13% of interactions are repeated within sessions. Recommendations were evaluated by the ability to predict the next item the user will interact with regardless of event type. To address this we kept only the first event with a product and summarized the level of interest of the user in such product by means of other features: number of interactions in the same product within the session, and flags about whether the user has checked product details or whether the product was added to cart within the session. This approach can be helpful for use cases where you wanna recommend to the user only unseen items.

Sessions augmentation - When users browse websites many interactions might not be associated with an item / product. In this dataset, for example, 70% of logged user events were page views on non-product URLs, like promotions and catalogs. Our hypothesis was that we could learn more fine-grained sequential patterns if we include the page views on non-product URLs in the sequence of user interactions together with product events. So we encoded product SKUs with page view URLs (as “virtual” products) in a single categorical feature. This approach largely improved our recommendation accuracy, and is worth trying for other session-based recommendation problems where page views not associated with items are available.

Model architectures, training and evaluation

Multi-modal features processing

Data might be available in multiple formats, in either structured (e.g. tabular) or unstructured formats (e.g. text, images, audio, video). When related to the prediction task, such information might be very valuable for ML models. Although neural networks are very flexible to combine different types of features, it is not always clear how to better combine such features of different nature and scale. In this competition we explored different ways to represent and combine the multi-modal features available in the dataset and discovered some tricks and techniques that help increase performance.

In particular, we empirically observed the best approach for combining categorical features represented by embeddings and numerical features (scalars) was applying layer normalization individually on each feature before concatenation, which we call feature-wise layer normalization.

For the pre-trained vectors provided in the dataset based on text (search query and product description vectors) and image (product image vector) we found out that it was better to apply L2-normalization rather than using the original vectors, either with layer normalization. We have used this technique previously to normalize articles’ textual representations in the news domain. L2-normalization makes the feature scale similar, but also preserves the similarity relationships between similar products.

Model architecture

Our base neural network architecture for the competition is presented below. From bottom-up, all interaction features are normalized, concatenated and combined by a Fully Connected (FC) layer to produce an interaction embedding. The sequence of interaction embeddings is then fed to a Transformer architecture (Transformer-XL or XLNET) which learns sequential patterns and outputs a vector for each position in the sequence. Those outputs are then projected by a FC layer to prediction vectors.

Fig 1. Base neural architecture using Transformers

To provide contextual information about the item to be predicted, we leveraged the Latent Cross technique, which proposed the combination of contextual information after sequential data processing in a post-fusion approach, using element-wise operations. We generate context-aware prediction vectors by combining the prediction vectors output projected from the Transformers output with contextual information. The contextual information can be any feature related to the session or to the next interaction. In this case, we included the search context which is computed as the average of search query vectors (if search queries happened within the session). The details and formal description of the post-fusion of other contextual information can be found in our paper.

In the output layer, we use the tying embeddings technique originally proposed for NLP, in which we share the weights of the item id embedding table with the output layer, followed by a softmax layer to predict the relevance scores over all items. Tying embeddings creates a shared semantic space between input and predicted items and saves memory of potentially huge embedding tables for datasets with high-cardinality item ids. This technique was originally applied for RecSys in the GRU4Rec paper and was rediscovered independently by Kaggle Grandmaster Jean Francois Puget who demonstrated it to be especially effective in the NVIDIA.AI team solution for Booking.com Challenge.

As the item cardinality of item ids for this dataset was not very high (~508k), we treat the recommendation as a multi-class classification problem and use cross-entropy loss.

Training and evaluation approach

The training set covers 3 months of user interactions data and the test set covers the subsequent month. So we reserved for validation the last 3 weeks of the train set. We observed from data analysis that the distribution of session length for the test set was close to the half of session length for the train set. Based on that, we split validation sessions into two halves: the first half for inference and the second half for metrics evaluation.

In real machine learning projects, we cannot use future data which is not available at inference time, but for machine learning competitions all available data is generally used, including test set data if possible. We observed that 6.7% of items present of the test set (i.e., new items) were not seen before in the train set. Thus, we decided to include the public part (first half) of test sessions in the training, so that we could learn embeddings for recently released products. We tried three approaches to train on test set: (1) concatenating train, validation and test sets and shuffling, (2) same as (1), but sorting data by time, and (3) pre-training with train and validation set and fine-tuning only the item embeddings using the test set. The latter approach performed the best (2% better than the other approaches) so we stick with it, like illustrated below.

Fig 2. Illustration of training and evaluation strategy

For fine-tuning we froze all the weights of the network except the item id embeddings, whose weights are shared with the output layer. This approach led to 2% accuracy improvement, compared with retraining the whole network with test data and we hope to explore this technique further in the future. For our evaluation setup, we perform an analogous approach, pre-training models with just training data and fine-tuning with validation data. Only the first half of validation sessions are used for fine-tuning, so that they are compatible with test sessions length.

Models are trained using a 5-fold strategy, meaning that for each validation fold a model is trained using Out-Of-Fold (OOF) sessions, corresponding to about 80% data. The full pipeline (pre-training, fine-tuning, evaluation and prediction over the test set) runtime for each model had an average 265 min with 53 min std., in an instance with a V100 GPU with 32 GB of memory and 8 CPUs. The model’s throughput during inference for next-click prediction averaged 800 sessions/second.

Results and Ensembling

For our final solution in the competition we ensembled four variations of the base neural architecture presented in Fig. 1. All neural architectures use tabular features and product description vectors, but vary as follows:

XLNET-IM — XLNET with image vectors
XLNET-S — XLNET with search context (post-fusion)
XLNET-IM-FC — XLNET with image vectors and item frequency capping (details on this in our paper)
TransfoXL-IM — Transformer-XL with image vectors.

Each of those four architectures were trained using three different hyperparameter configurations that performed well in our CV scores after hyperparameter tuning. The hyperparameters used for our models, the corresponding command lines and source code for reproducibility are available in the solution GitHub repository.

Table 1 shows the leaderboard scores for each architecture and its three hyperparameters configurations (suffixed by 1,2, and 3) after ensembling their 5-folds predictions. The top-4 single models were XLNET-IM-2, XLNET-IM-3, XLNET-IM-FC-2, XLNET-IM-FC-3, which all include the product image vectors as input features. Embedding the original item ids (XLNET-IM-2, XLNET-IM-3) worked better than applying frequency capping on the item id (XLNET-IM-FC-2, XLNET-IM-FC-3). It can also be observed that the Transformer-XL, trained with causal LM approach, performed worse than XLNet which was trained with masked LM.

Table 1. Leaderboard scores for individual architectures and ensembles

For each model, we saved the top-100 recommended items per session and used a weighted sum to produce the final recommendation lists. In Table 1 we also present the final ensemble and full ensemble leaderboard results. Our full ensemble was composed of the test set predictions provided by 4 architectures x 3 hyperparameter configurations x 5 folds which results in 60 models. In fact, some of the predictions for our final ensemble finished on top of the deadline hour and we had a last-minute memory issue when ensembling those large prediction files. Thus our final ensemble submission used only models from 2 folds (24 models), which scored 1st in the Leaderboard (LB) for F1 (0.0744) and 2nd in the MRR (0.2771) LB, very close to 1st place. As soon as the LB opened again the next day for probing, we submitted our full ensemble which scored 0.0747 for F1 and 0.2783 for MRR, which would have placed 1st for that metric too. It can be seen that the full ensemble improved the best single model LB score by 1.4% for MRR and 0.9% for F1.

If you’re interested in more details please check our paper, where we present some interesting analysis of the model predictions to better understand the recommendation accuracy with respect to different characteristics of sessions (e.g. for different session lengths) and items (e.g. for popular/infrequent items and cheap/expensive products) and the source code.

Conclusion

In this post, we shared our winning solutions for the session-based recommendation task of the SIGIR 2021 Workshop on E-commerce Data Challenge and how we used an ensemble of Transformers models to effectively learn users browsing patterns and predict the next interacted items. The proposed architecture leveraged multi-modal information from tabular, textual and image data for more accurate recommendations.

The library we used in this and other competitions will soon be open-sourced under the name Transformers4Rec. It wraps the popular HuggingFace’s Transformers NLP library for Sequential and Session-based recommendation, and was born from our research in this area. Stay tuned to the Merlin repo or join us at ACM RecSys 2021 where we’ll be announcing its launch!

Competing team

The participating team for the SIGIR eCommerce Data Challenge: Gabriel de Souza Pereira Moreira, Sara Rabhi, Ronay Ak and Md Yasin Kabir, supported by our manager Even Oldridge.