Winning the RecSys2021 Challenge by a Diverse Set of XGBoost and Neural Network Models

Published in

NVIDIA Merlin

6 min readJul 22, 2021

The ACM Conference on Recommender System organizes a yearly data science competition. This year was particularly special as Twitter hosted it for the 2nd time in the row. Last year, we won the RecSys2020 competition. As all participants published their solution with code, which provided an excellent starting point for this year, we needed to improve on our previous solution to defend our title. And yes, we did it :)! The 2nd time in a row, we won the RecSys competition, having the highest score in all metrics and this blog post focuses on the new techniques we developed this year. Our solution is an ensemble of stacked models of in total 5 XGBoost and 3 neural networks models. First, let’s take a look at the competition.

RecSys 2021 Challenge

In 2020, Twitter provided a dataset of ~150–200M tweet-user pairs with the goal to predict the user engagements of a tweet: Will a user like, reply, comment or retweet a tweet? For the RecSys Challenge in 2021, the data structure and the goal were the same, but the dataset size increased to almost 1 billion tweet-user pairs over 4 consecutive weeks. The evaluation metrics per target, Average Precision (AP) and Relative Cross Entropy (RCE), were updated to reflect fair recommendations by calculating them in respect of the authors’ popularity. The metrics were calculated per group, which was defined based on the authors’ follower count, and then averaged. But that was not all.

To provide a more realistic production environment, runtime and computational resource limitations were added during inference. The participants could train offline with any resources, but they had to upload their code, features and models to a server, which scores the test dataset of ~15M rows on a single core CPU with 64GB memory. The evaluation was limited to 24 hours, which is equivalent to around 6 ms per example. This constraint limited the potential model architectures, for example a BERT model to process the text would not be quick enough.

There was a final twist in this years’ competition. Two weeks before the final deadline, the hosts released the validation dataset (a.k.a. public test dataset) with targets and allowed using it for training. This changed the competition as new data was available.

Stacking instead of Retraining or Fine Tuning

Model architecture of the winning solution of the RecSys2021 competition

For 2.5 months, we only had access to the training dataset, which were the first 3 weeks of the dataset and developed 3 XGBoost models and 3 neural networks models. After the validation dataset was released, the competition changed to: How to leverage the additional information? Actually, this is a common problem in many production systems. A recommendation system model is trained and deployed, but the service collects new data every minute. Common solutions are to retrain the model from scratch or fine-tune the existing model on the new dataset.

We experimented with both approaches and experienced that the best performing technique is to stack our existing models. Stacking is to train a new model, which uses the prediction of other models as an input. It learns how to ensemble trained models. There are multiple advantages for stacking. First, the stage 2 model (stacked model) can be less complex, as it calibrates the existing models from stage 1. Second, the stage 2 model requires less data to retrain, as the stage 1 models’ predictions contain rich information. We trained two different XGBoost models using the predictions of 3 XGBoost and 3 Neural Networks and ensembled them by a simple average. We experimented with many combinations of the six stage 1 models and our analysis showed that using all six models performed best.

As the stacked models can be trained faster and on less data, we think that could be a new approach for production systems, where stacked models are trained with a high frequency to address the latest trends and the stage 1 models are updated with a lower frequency.

Our final solution predicted the test dataset in 23 hours and 40 minutes on a single core CPU with 64GB memory. What if we could use GPUs during inference?

GPU-accelerated inference will be the new black

When the host announced the competition rules including the runtime and computational resources, there was a discussion to enable GPUs for inference. It is understandable to add latency requirements, so that the developed solutions can be really deployed to production. A user will not wait 5s on a website to get the recommendations. Last year we showed how GPU accelerated our training pipelines reducing the runtime from 11 hours to ~2 minutes, a 280x speed up, this year we focused on the inference. What is the value of GPUs in a production system?

We used a set of open source libraries, Forest Inference Library, NVTabular, RAPIDS cuDF, PyTorch and TensorFlow, to accelerate our pipeline end-to-end using a single NVIDIA A100 GPU with 40GB memory. The experiments reduced the prediction time from 23 hours and 40 minutes on a single core CPU down to 5 minutes and 30 seconds. That is a speed-up of ~260x!!

Normally, a cloud instance has a different ratio of CPU cores and host memory. For a fair comparison, we used the cheapest Google Cloud Platform instance with 64GB memory and enabled all CPU cores. The instance of type e2-highmem-8 with 8 CPU cores and 64GB memory requires 8 hours and 24 minutes. Our GPU-accelerated solution is still 92x faster.

Barcharts to visualize speed ups per inference step (left) and to visualize cost savings (right)

The left barchart visualizes the speed-up of a single NVIDIA A100 versus an 8 core CPU per pipeline step. GPUs accelerated all steps of our pipeline, from preprocessing (imputing NaN, extracting features from text or decode BERT tokens) to feature engineering and XGBoost or neural network predictions. Each step has a similar relative share of runtime and it is important to accelerate the pipeline end-to-end.

You may think that this is great, but GPUs are much more expensive. We analyzed the dollar costs of running the CPU and GPU version on Google Cloud Platform. Running an e2-highmem-8 instance for 8 hours and 24 minutes costs ~$3 versus running an a2-highgpu-1g for 5 minutes and 30 seconds costs $0.34 (see right barchart). Using GPU accelerates inference pipelines by 92x and simultaneously, reduces the costs by ~8x-9x. Another option is trade some of the speed ups to deploy larger models with higher accuracy as GPUs can run them given the latency constraints.

Is there not something missing?

The ACM RecSys 2021 competition ran for almost 4 months and we developed an ensemble of stacked models using in total 5 XGBoost and 3 neural network models. As you may have noticed, we haven’t talked about the details of our feature engineering, xgboost model or neural network architecture. We developed many new tricks and techniques, which we cannot fit into a single blog post. If you are interested to learn more about our RecSys 2021 solution, we recommend to checkout:

NVIDIA’s Grandmaster Series in which our team provides a more details
Our paper, GPU Accelerated Boosted Trees and Deep Neural Networks for Better Recommender Systems
Join our free webinar on the 29th of July about deep learning recommender systems
Checkout our solution on GitHub

Stay tuned on our NVIDIA Merlin blog. We will share technical blog posts about scaling embedding tables in PyTorch or TensorFlow, accelerate TensorFlow Keras embeddings on GPU, accelerate boosted-trees with Forest Inference Library, in the coming weeks.

There is more

NVIDIA KGMON, a Kaggle Grandmaster team, and the NVIDIA Merlin team won three competitions in the last 6 months. If you are interested in building world-class recommender systems, we recommend checking out our other blog posts about our winning solution for WSDM WebTour 21 Challenge organized by Booking.com and SIGIR 2021 Workshop on E-commerce Data Challenge.

Furthermore, scaling and deploying recommendation systems are challenging. NVIDIA Merlin is an open source framework to accelerate recommender systems pipelines end-to-end with GPUs. Feature engineering, training and inference can be deployed easily with GPU-acceleration. Checkout our examples and documentation.

Team

A collaboration of the participating team in the RecSys2021 challenge: Chris Deotte, Bo Liu, Benedikt Schifferer and Gilberto Titericz.