Training Deep Learning Based Recommender Systems 9x Faster with TensorFlow

Even Oldridge
NVIDIA Merlin
Published in
6 min readJan 6, 2021

By Even Oldridge and Benedikt Schifferer

Building recommender systems (RecSys) at scale is a non-trivial process. The data is huge, training takes a long time, and getting models into production takes thought and care. Tensorflow provides an ecosystem for the latter, with TF.Data and Feature Columns for transforming data, a standardized data format in TF Records, and a dedicated inference server in TF Serving. This has made Tensorflow the framework of choice for many companies when designing and deploying recommender systems, and for good reason.

However, when it comes to ETL and Training, there are gaps in GPU kernel coverage and implementation details that make the training of recommender systems much slower than they could be. Unlike the computer vision and NLP domains where GPUs have been applied effectively, recommenders aren’t able to effectively leverage the parallel compute and high memory bandwidth that GPUs have to offer. In our previous blog post we dove into some of the conceptual aspects of that problem; at its heart, dataloaders aren’t able to keep the GPU utilized. To help solve this we leveraged the same tools used to develop NVTabular, Dask-cuDF and RAPIDs and built accelerated data loaders, along with a few other layers to support better acceleration of recommenders on GPU.

While NVTabular supports multiple frameworks (TensorFlow, PyTorch and Fast.AI), our focus in this blog is on its integration with the TensorFlow ecosystem. There are some specifics to Tensorflow that require additional thought to fully leverage and benefit from these accelerations. We’ll walk through some experiments for speed, convergence and GPU utilization and finally show how to use the NVTabular dataloader to speed up your own recommender system workflows in Tensorflow.

Kicking the Tires

In our previous post we argued that item by item dataloading and even the windowed dataloading that Tensorflow uses was an issue. A quick experiment shows that the throughput of Tensorflow training is approximately the same as that of the dataloader. If we can improve dataloading throughput, we’re likely to get an improvement in total training time. How much faster can we make things on the dataloading side though? Using our dataloader which loads large chunks of data into a buffer on the GPU and shuffles it there we’re able to show a 24x speedup over the native Tensorflow windowed dataloader, going from 177K samples/s to a whopping 4.2 Million. Unfortunately that’s just the dataloader speedup; training the model also takes time, so the final performance we see when training isn’t as significant, but we’ve eliminated the bottlenecks caused by inefficient dataloading.

Let’s see what this means for training; In Figure 1 below we share the results of our benchmark which compares the NVTabular dataloader to the TF native one in two configurations, full and mixed precision. From the graph we see a 8.1x speedup at full precision and a 9.3x speedup when using mixed precision. Mileage may vary depending on datasets and model configuration, but if you’re training with tabular data this should be a substantial speedup and we encourage you to try it out for yourself on your data.

Figure 1. Speedup over Tensorflow native windowed dataloader. Experiment details below.

In each experiment we use the same hardware, a DGX-A100 with 40GB GPU memory, with data stored on local NVME. Criteo Click Ads Prediction is used as a dataset with 150M examples for the first day used for training and 150M examples of the 3rd day used for validation. The neural network architecture uses Embedding Layers for the categorical features and has 4 hidden layers with each 1024 neurons in the MLP tower. You can find a link to our experiment notebook here.

We also see a big increase in GPU utilization, with average utilization jumping from 45% to 80% for full precision and from 20% to 60% for mixed. The NVTabular dataloader is feeding data to the GPU at a much faster rate and dataloading is no longer the bottleneck in training which allows the GPU to shine!

Figure 2. GPU utilization for different experiments over time. Utilization is smoothed by moving average over a time interval of 20 seconds.

Too good to be true?

The NVTabular dataloader is able to speed up the training pipeline by 9x with mixed precision but does it have an impact on model quality? Data is grabbed in multiple large contiguous chunks and (optionally) shuffled on GPU. We’re not doing the same shuffling mechanism as Tensorflow’s windowed dataloader and we also don’t provide a full shuffle of the data. Aside here; we’ll have a blog post out sometime soon on shuffling and data ordering but in the meantime let’s continue for brevity’s sake. As data scientists and ML Engineers, we train deep learning models to achieve a high accuracy and data ordering can affect the model’s accuracy through the randomness in batch creation. We need to validate that training a deep learning model with the NVTabular data loader results in the same convergence behavior. Thankfully in our testing that has been the case; we see nearly identical convergence when using either dataloader.

Figure 3. Convergence behavior for TensorFlow and NVTabular data loader. (L) Training (R) Final Validation Score

Phenomenal cosmic power! (itty bitty code change)

When we built the dataloader we had the existing ecosystem and ease of integration in mind. KerasSequenceLoader is modelled after tf.data.experimental.make_batched_features_dataset in our API. It’s initialized with a data schema which defines the columns that are categorical and numerical, as well as the label so that the corresponding tensor types can be concatenated together when passing data to the framework.

https://gist.github.com/bschifferer/78e2ca51d004ead898ea9d3d96879c6a

Our Tensorflow solution also includes a number of model layers critical to achieving optimal performance which will be covered in a future post. The most important of these is a custom embedding layer that is designed to more efficiently handle single-hot embedding lookups. We’ve also developed a utility designed to allow you to convert Tensorflow Feature Column definitions into NVTabular for doing transformations to the data more efficiently. A more advanced example of Tensorflow feature engineering and training is available in our Movielens dataset example.

Try It Out for Yourself

Our benchmarks show a speed-up of 8–9x with consistent convergence behavior and seamless integration into a TensorFlow pipeline. Check out our notebooks with TensorFlow data loader for Rossmann, MovieLens, and our detailed notebook on Tensorflow training as additional resources. We’ve worked hard to try to make it easy for you to try this out on your own data. The experiments from this benchmark can also be found here. You may also be interested in accelerating your full pipeline with NVIDIA Merlin. For additional information, see our other Merlin blog posts and HugeCTR training deep learning model.

NVTabular is an open source project so please check out the repo if you have any questions. We’d love to hear what you’re working on and any gaps you see. You can reach us through our github with any issue or feature request or by leaving a comment here. And finally, if you’re as passionate about recommender systems as we are please check out this open role for the team. We’re growing fast and would love to work with you to help make RecSys fast and easy to use on the GPU.

This work would not have been possible without the awesome engineering efforts of Alec Gunny who wrote much of the code required for the dataloader and Tensorflow layers.

--

--

Even Oldridge
NVIDIA Merlin

I’m a research scientist working at NVidia on deep learning for tabular data.